“Copy/Paste is the mother of learning.”
“Repetition! Repetition is the mother of learning.”
Assumption: Working directory has 3 sub-folders named "data", "images", "code".
# #R Version
R.version.string
## [1] "R version 4.1.2 (2021-11-01)"# #Working Directory
getwd()
## [1] "D:/Analytics/xADSM"# #Version information about R, the OS and attached or loaded packages
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252 LC_MONETARY=English_India.1252
## [4] LC_NUMERIC=C LC_TIME=English_India.1252
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] qcc_2.7 moments_0.14 VIM_6.1.1 colorspace_2.0-2 mice_3.13.0
## [6] readxl_1.3.1 kableExtra_1.3.4 lubridate_1.8.0 Lahman_9.0-0 gapminder_0.3.0
## [11] nycflights13_1.0.2 gifski_1.4.3-1 data.table_1.14.2 forcats_0.5.1 stringr_1.4.0
## [16] dplyr_1.0.7 purrr_0.3.4 readr_2.1.0 tidyr_1.1.4 tibble_3.1.6
## [21] ggplot2_3.3.5 conflicted_1.0.4
##
## loaded via a namespace (and not attached):
## [1] webshot_0.5.2 httr_1.4.2 tools_4.1.2 backports_1.3.0 bslib_0.3.1
## [6] utf8_1.2.2 R6_2.5.1 DBI_1.1.1 nnet_7.3-16 withr_2.4.2
## [11] sp_1.4-6 tidyselect_1.1.1 compiler_4.1.2 rvest_1.0.2 xml2_1.3.2
## [16] bookdown_0.24 sass_0.4.0 scales_1.1.1 DEoptimR_1.0-9 lmtest_0.9-39
## [21] robustbase_0.93-9 proxy_0.4-26 systemfonts_1.0.3 digest_0.6.28 rmarkdown_2.11
## [26] svglite_2.0.0 pkgconfig_2.0.3 htmltools_0.5.2 fastmap_1.1.0 rlang_0.4.12
## [31] rstudioapi_0.13 jquerylib_0.1.4 generics_0.1.1 zoo_1.8-9 jsonlite_1.7.2
## [36] car_3.0-12 magrittr_2.0.1 Matrix_1.3-4 Rcpp_1.0.7 munsell_0.5.0
## [41] fansi_0.5.0 abind_1.4-5 lifecycle_1.0.1 stringi_1.7.5 yaml_2.2.1
## [46] carData_3.0-4 MASS_7.3-54 crayon_1.4.2 lattice_0.20-45 hms_1.1.1
## [51] knitr_1.36 pillar_1.6.4 ranger_0.13.1 boot_1.3-28 codetools_0.2-18
## [56] glue_1.5.0 evaluate_0.14 laeken_0.5.2 vcd_1.4-9 vctrs_0.3.8
## [61] tzdb_0.2.0 cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1 cachem_1.0.6
## [66] xfun_0.28 broom_0.7.10 e1071_1.7-9 class_7.3-19 viridisLite_0.4.0
## [71] ellipsis_0.3.2# #Pandoc Version being used by RStudio
rmarkdown::pandoc_version()
## [1] '2.14.0.3'I wanted to have a single document containing Notes, Codes, & Output for a quick reference for the lectures. Combination of multiple file formats (docx, csv, xlsx, R, png etc.) was not working out for me. So, I found the Bookdown package to generate this HTML file.
All of us had to stumble through some of the most common problems individually and as we are approaching deeper topics, a more collaborative approach might be more beneficial.
Further, the lectures are highly focused and thus I had to explore some sidetopics in more details to get the most benefit from them. I have included those topics and I am also interested in knowing about your experiences too.
Towards that goal, I am sharing these notes and hoping that you would run the code in your own environment and would raise any queries, problems, or difference in outcomes. Any suggestion or criticism is welcome. I have tried to not produce any significant changes in your working environment. Please let me know if you observe otherwise.
Currently, my priority is to get in sync with the ongoing lectures. The time constraint has led to issues given below. These will be corrected as and when possible.
Last, but not the least, I am also learning while creating this, so if you think I am wrong somewhere, please point it out. I am always open for suggestions.
Thank You all for the encouragement.
Shivam
R is Case-sensitive i.e. c() not
C()and View() notview()
Hash Sign “#” comments out anything after it, till the newline. There are no multiline comments.
Backslash “\” is reserved to escape the character that follows it.
Escape key stops the parser i.e. “+” sign where R is waiting for more input before evaluation.
Overview
Execute the current expression in Source Pane (Top) by ‘Run’ Button or "Ctrl+ Enter"
Execute the current expression in Console Pane (Bottom) by “Enter”
Windows 10 uses backslash “\” for PATH. R, however, uses slash “/.” Backslash “\” is escape character in R.
In R Studio, Set Working Directory by:
# #Current Working Directory
getwd()
## [1] "D:/Analytics/xADSM"
#
# #R Installation Directory (Old DOS Convention i.e. ~1 after 6 letters)
R.home()
## [1] "C:/PROGRA~1/R/R-41~1.2"
Sys.getenv("R_HOME")
## [1] "C:/PROGRA~1/R/R-41~1.2"
#
# #This is Wrapped in IF Block to prevent accidental execution
if(FALSE){
# #WARNING: This will change your Working Directory
setwd("~")
}If the R program is written over the console, line by line, then the output is printed automatically i.e. no function needed for printing. This is called implicit printing.
Inside an R Script File, implicit printing does not work and the expression needs to be printed explicitly.
In R, the most common method to print the output ‘explicitly’ is by the function print().
# #Implicit Printing: This will NOT be printed to Console, if it is inside an R Script.
"Hello World!"
#
# #Implicit Printing using '()': Same as above
("Hello World!")
#
# #Explicit Printing using print() : To print Objects to Console, even inside an R Script.
print("Hello World!")
## [1] "Hello World!"Everything that exists in R is an object in the sense that it is a kind of data structure that can be manipulated. Expressions for evaluation are themselves objects; Evaluation consists of taking the object representing an expression and returning the object that is the value of that expression.
# #ls(): List ALL Objects in the Current NameSpace (Environment)
ls()
## character(0)Caution: Always use “<-” for the assignment, NOT the “=”
While the “=” can be used for assignment, its usage for assignment is highly discouraged because it may behave differently under certain subtle conditions which are difficult to debug. Convention is to use “=” only during function calls for arguments association (syntactic token).
There are 5 assignment operators (<-, =, <<-, ->, ->>), others are not going to be discussed for now.
All the created objects are listed in the Environment Tab of the Top Right Pane.
# #Assignment Operator "<-" is used to assign any value (ex: 10) to any object (ex: 'bb')
bb <- 10
#
# #Print Object
print(bb)
## [1] 10In the Environment Tab, any object can be selected and deleted using Brush.
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] 10
#
# #Remove an Object
rm(bb)
#
# #Equivalent
if(FALSE) {rm("bb")} #Same
if(FALSE) {rm(list = "bb")} #Faster, verbose, and would not work without quotes
#
# #Trying to Print an Object 'bb' and Handling the Error, if thrown
tryCatch(print(bb), error = function(e) print(paste0(e)))
## [1] "Error in print(bb): object 'bb' not found\n"5.1 Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
5.2 Elements are the entities on which data are collected. (Generally ROWS)
5.3 A variable is a characteristic of interest for the elements. (Generally COLUMNS)
5.4 The set of measurements obtained for a particular element is called an observation.
5.5 Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
R has 6 basic data types (logical, integer, double, character, complex, and raw). These data types can be combined to form Data Structures (vector, list, matrix, dataframe, factor etc.). Refer What is a Vector!
Atomic vectors are homogeneous i.e. each component has the same datatype. A vector type can be checked with the typeof() or class() function. Its length, i.e. the number of elements in the vector, can be checked with the function length().
If the output of an expression does not show numbers in brackets like ‘[1]’ then it is a ’NULL’ type return. [Numbers] show that it is a Vector. Ex: str() and cat() outputs are of NULL Type.
Use function c() to create a vector (or a list) -
NULL < raw < logical < integer < double < complex < character < list < expression.Caution: Colon “:” might produce unexpected length of vectors (in case of 0-length vectors). Suggestion: Use colon only with hardcoded numbers i.e. “1:10” is ok, “1:n” is dangerous and should be avoided.
Caution: seq() function might produce unexpected type of vectors (in case of 1-length vectors). Suggestion: Use seq_along(), seq_len().
# #To know about an Object: str(), class(), length(), dim(), typeof(), is(), attributes(), names()
# #Integer: To declare as integer "L" (NOT "l") is needed
ii_int <- c(1L, 2L, 3L, 4L, 5L)
str(ii_int)
## int [1:5] 1 2 3 4 5
#
# #Double (& Default)
dd_dbl <- c(1, 2, 3, 4, 5)
str(dd_dbl)
## num [1:5] 1 2 3 4 5
#
# #Character
cc_chr <- c('a', 'b', 'c', 'd', 'e')
str(cc_chr)
## chr [1:5] "a" "b" "c" "d" "e"
#
# #Logical
ll_lgl <- c(TRUE, FALSE, FALSE, TRUE, TRUE)
str(ll_lgl)
## logi [1:5] TRUE FALSE FALSE TRUE TRUE# #Integer Vector of Length 1
nn <- 5L
#
# #Colon ":" Operator - Avoid its usage
str(c(1:nn))
## int [1:5] 1 2 3 4 5
c(typeof(pi:6), typeof(6:pi))
## [1] "double" "integer"
#
# #seq() - Avoid its usage
str(seq(1, nn))
## int [1:5] 1 2 3 4 5
str(seq(1, nn, 1))
## num [1:5] 1 2 3 4 5
str(seq(1, nn, 1L))
## num [1:5] 1 2 3 4 5
str(seq(1L, nn, 1L))
## int [1:5] 1 2 3 4 5
#
# #seq_len()
str(seq_len(nn))
## int [1:5] 1 2 3 4 5str(seq(1, 5, 1))
## num [1:5] 1 2 3 4 5str(letters[1:5])
## chr [1:5] "a" "b" "c" "d" "e"str(1:5 %% 2 == 0)
## logi [1:5] FALSE TRUE FALSE TRUE FALSE# #Create Two Vectors
income <- c(100, 200, 300, 400, 500)
gender <- c("male", "female", "female", "female", "male")
#
# #Create a DataFrame
bb <- data.frame(income, gender)
#
# #Print or View DataFrame
#View(bb)
print(bb)
## income gender
## 1 100 male
## 2 200 female
## 3 300 female
## 4 400 female
## 5 500 male
#
# #Struture
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
#
# #Names
names(bb)
## [1] "income" "gender"R Script file extension is “.R”
"Ctrl+ S" will Open Save Window at Working Directory.
"Ctrl+ O" will Open the Browse Window at Working Directory.
# #Subdirectory "data" has data files like .csv .rds .txt .xlsx
# #Subdirectory "code" has scripts files like .R
# #Subdirectory "images" has images like .png
#
# #Check if a File exists
path_relative <- "data/aa.xlsx" #Relative Path
#
if(file.exists(path_relative)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", getwd(), "/", path_relative, "\n"))
}
## File Exists
#
if(exists("XL", envir = .z)) {
cat(paste0("Absolute Path exists as: ", .z$XL, "\n"))
path_absolute <- paste0(.z$XL, "aa", ".xlsx") #Absolute Path
#
if(file.exists(path_absolute)) {
cat("File Exists\n")
} else {
cat(paste0("File does not exist at ", path_absolute, "\n"))
}
} else {
cat(paste0("Object 'XL' inside Hidden Environment '.z' does not exist. \n",
"It is probably File Path of the Author, Replace the File Path from Your own Directory\n"))
}
## Absolute Path exists as: D:/Analytics/xADSM/data/
## File Existswrite.csv() and read.csv() combination can be used to export data and import it back into R. But, it has some limitations -
xx_data <- bb
#
# #Write a dataframe to a CSV File
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from the CSV into a dataframe
yy_data <- read.csv("data/B09_xx_data.csv")
#
# #Check if the object being read is same as the obejct that was written
identical(xx_data, yy_data)
## [1] FALSE# #Exercise to show how to match the objects being imported /exported from CSV
str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
# #Write to CSV
write.csv(xx_data, "data/B09_xx_data.csv")
#
# #Read from CSV by providing row.names Column and colClasses()
yy_data <- read.csv("data/B09_xx_data.csv", row.names = 1,
colClasses = c('character', 'numeric', 'character'))
#
# #Coerce row.names attribute to integer
attr(yy_data, "row.names") <- as.integer(attr(yy_data, "row.names"))
#
# #Check if the objects are identical
identical(xx_data, yy_data)
## [1] TRUE
stopifnot(identical(xx_data, yy_data))str(bb)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
xx_data <- bb
#
# #Save the Object as RDS File
saveRDS(xx_data, "data/B09_xx_data.rds")
#
# #Read from the RDS File
yy_data <- readRDS("data/B09_xx_data.rds")
#
# #Objects are identical (No additional transformations are needed)
identical(xx_data, yy_data)
## [1] TRUEstr(xx_data)
## 'data.frame': 5 obs. of 2 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
# #Adding a Column to a dataframe
xx_data <- data.frame(xx_data, age = 22:26)
#
# #Adding a Column to a dataframe by adding a Vector
x_age <- 22:26
xx_data <- data.frame(xx_data, x_age)
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
#
# #Adding a Column to a dataframe by using dollar "$"
xx_data$age1 <- x_age
#
# #Adding a Blank Column using NA
xx_data$blank <- NA
#
# #Editing of a dataframe can also be done
# edit(xx_data)
str(xx_data)
## 'data.frame': 5 obs. of 6 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ age : int 22 23 24 25 26
## $ x_age : int 22 23 24 25 26
## $ age1 : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NA
#
# #Removing a Column by subsetting
xx_data <- xx_data[ , -c(3)]
#
# #Removing a Column using NULL
xx_data$age1 <- NULL
str(xx_data)
## 'data.frame': 5 obs. of 4 variables:
## $ income: num 100 200 300 400 500
## $ gender: chr "male" "female" "female" "female" ...
## $ x_age : int 22 23 24 25 26
## $ blank : logi NA NA NA NA NAPackages include reusable functions, the documentation that describes how to use them, and sample data.
In R Studio: Packages Tab | Install | Package Name = “psych” | Install
if(FALSE){
# #WARNING: This will install packages and R Studio will NOT work for that duration
# #Install Packages and their dependencies
install.packages("psych", dependencies = TRUE)
}# #Load a Package with or without Quotes
library(readxl)
library("readr")# #Load Multiple Packages
pkg_chr <- c("ggplot2", "tibble", "tidyr", "readr", "dplyr")
#lapply(pkg_chr, FUN = function(x) {library(x, character.only = TRUE)})
#
# #Load Multiple Packages, Suppress Startup Messages, and No console output
invisible(lapply(pkg_chr, FUN = function(x) {
suppressMessages(library(x, character.only = TRUE))}))# #Detach a package
#detach("package:psych", unload = TRUE)
#
# #Search Package in the already loaded packages
pkg_chr <- "psych"
if (pkg_chr %in% .packages()) {
# #Detach a package that has been loaded previously
detach(paste0("package:", pkg_chr), character.only = TRUE, unload = TRUE)
}To Import Excel in R Studio : Environment | Dropdown | From Excel | Browse
Object imported by read.csv() i.e. ‘mydata’ is NOT same as the one imported by read_excel() i.e. ‘mydata_xl’
All of these objects can be converted into any other form as needed i.e. dataframe to tibble or vice-versa.
# #Data File Name has been modified to include lecture number "B09"
# #All Data Files are in the sub-directory named 'data'
mydata <- read.csv("data/B09-FLIGHTS.csv")
#
# #To Copy from Clipboard, assuming copied from xlsx i.e. tab separated data
mydata_clip <- read.csv("clipboard", sep = '\t', header = TRUE)# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
xx_csv <- paste0("data/", "B09-FLIGHTS",".csv")
xx_rds <- paste0("data/", "b09_flights", ".rds")
b09_flights <- NULL
if(file.exists(xx_rds)) {
b09_flights <- readRDS(xx_rds)
} else {
# #Read CSV
b09_flights <- read.csv(xx_csv)
# #Write Object as RDS
saveRDS(b09_flights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata <- b09_flightsstr(mydata)
## 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : int 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : int 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : int 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : int 1400 1416 1089 1576 762 719 1065 229 944 733 ...
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...head(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight
## 1 2013 1 1 517 515 2 830 819 11 UA 1545
## 2 2013 1 1 533 529 4 850 830 20 UA 1714
## 3 2013 1 1 542 540 2 923 850 33 AA 1141
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725
## 5 2013 1 1 554 600 -6 812 837 -25 DL 461
## 6 2013 1 1 554 558 -4 740 728 12 UA 1696
## tailnum origin dest air_time distance hour minute time_hour
## 1 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
## 2 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
## 3 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
## 4 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
## 5 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
## 6 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00tail(mydata)
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## 336771 2013 9 30 NA 1842 NA NA 2019 NA EV
## 336772 2013 9 30 NA 1455 NA NA 1634 NA 9E
## 336773 2013 9 30 NA 2200 NA NA 2312 NA 9E
## 336774 2013 9 30 NA 1210 NA NA 1330 NA MQ
## 336775 2013 9 30 NA 1159 NA NA 1344 NA MQ
## 336776 2013 9 30 NA 840 NA NA 1020 NA MQ
## flight tailnum origin dest air_time distance hour minute time_hour
## 336771 5274 N740EV LGA BNA NA 764 18 42 2013-09-30 18:00:00
## 336772 3393 <NA> JFK DCA NA 213 14 55 2013-09-30 14:00:00
## 336773 3525 <NA> LGA SYR NA 198 22 0 2013-09-30 22:00:00
## 336774 3461 N535MQ LGA BNA NA 764 12 10 2013-09-30 12:00:00
## 336775 3572 N511MQ LGA CLE NA 419 11 59 2013-09-30 11:00:00
## 336776 3531 N839MQ LGA RDU NA 431 8 40 2013-09-30 08:00:00# #library(readxl)
mydata_xl <- read_excel("data/B09-FLIGHTS.xlsx", sheet = "FLIGHTS")# #library(readxl)
xx_xl <- paste0("data/", "B09-FLIGHTS",".xlsx")
xx_rds_xl <- paste0("data/", "b09_flights_xls", ".rds")
b09_flights_xls <- NULL
if(file.exists(xx_rds_xl)) {
b09_flights_xls <- readRDS(xx_rds_xl)
} else {
b09_flights_xls <- read_excel(xx_xl, sheet = "FLIGHTS")
saveRDS(b09_flights_xls, xx_rds_xl)
}
rm(xx_xl, xx_rds_xl)
mydata_xl <- b09_flights_xls
#str(mydata_xl)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:336776] 2013 2013 2013 2013 2013 ...
## $ month : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : chr [1:336776] "517" "533" "542" "544" ...
## $ sched_dep_time: num [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : chr [1:336776] "2" "4" "2" "-1" ...
## $ arr_time : chr [1:336776] "830" "850" "923" "1004" ...
## $ sched_arr_time: num [1:336776] 819 830 850 1022 837 ...
## $ arr_delay : chr [1:336776] "11" "20" "33" "-18" ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : num [1:336776] 1545 1714 1141 725 461 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : chr [1:336776] "227" "227" "160" "183" ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...# #Following Setup allows us to read CSV only once and then create an RDS file
# #Its advantage is in terms of faster loading time and lower memory requirment
# #library(readr)
xx_csv <- paste0("data/", "B09-FLIGHTS",".csv")
xx_rds <- paste0("data/", "xxflights", ".rds")
xxflights <- NULL
if(file.exists(xx_rds)) {
xxflights <- readRDS(xx_rds)
} else {
xxflights <- read_csv(xx_csv, show_col_types = FALSE)
attr(xxflights, "spec") <- NULL
attr(xxflights, "problems") <- NULL
saveRDS(xxflights, xx_rds)
}
rm(xx_csv, xx_rds)
mydata_rdr <- xxflights# #Subset All Rows and last 3 columns
data6 <- mydata[ , c(17:19)]
str(data6)
## 'data.frame': 336776 obs. of 3 variables:
## $ hour : int 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : int 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour: chr "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# #Subset by deleting the 1:16 columns
data7 <- mydata[ , -c(1:16)]
stopifnot(identical(data6, data7))Caution: Attaching a Dataset should be avoided to prevent unexpected behaviour due to ‘masking.’ Using full scope resolution i.e. ‘data_frame$column_header’ would result in fewer bugs. However, if a Dataset has been attached, please ensure that it is detached also.
if(FALSE){
# #WARNING: Attaching a Dataset is discouraged because of 'masking'
# #'dep_time' is Column Header of a dataframe 'mydata'
tryCatch(str(dep_time), error = function(e) print(paste0(e)))
## [1] "Error in str(dep_time): object 'dep_time' not found\n"
# #Attach the Dataset
attach(mydata)
# #Now all the column headers are accessible without the $ sign
str(dep_time)
## int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
# #But, there are other datasets also, attaching another one results in MESSAGE
attach(mydata_xl)
## The following objects are masked from mydata:
##
## air_time, arr_delay, arr_time, carrier, day, dep_delay, dep_time, dest,
## distance, flight, hour, minute, month, origin, sched_arr_time,
## sched_dep_time, tailnum, time_hour, year
str(dep_time)
## chr [1:336776] "517" "533" "542" "544" "554" "554" "555" "557" "557" "558" "558" ...
#
# #'mydata_xl$dep_time' masked the already present 'mydata$dep_time'.
# #Thus now it is showing as 'chr' in place of original 'int'
# #Column Header Names can be highly varied and those will silently mask other variable
# #Hence, attaching a dataset would result in random bugs or unexpected behaviours
#
# #Detach a Dataset
detach(mydata_xl)
detach(mydata)
}Figure 1.1 Correlation using psych::pairs.panels()
# # Subset 3 Columns and 1,00,000 rows
x_rows <- 100000L
data_pairs <- mydata[1:x_rows, c(7, 16, 9)]
#
# #Equivalent
data_pairs <- mydata %>%
select(air_time, distance, arr_delay) %>%
slice_head(n = x_rows)
#
if( nrow(data_pairs) * ncol(data_pairs) > 1000000 ) {
print("Please reduce the number of points to a sane number!")
ggplot()
} else {
#B09P01
pairs.panels(data_pairs)
}These allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.
An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. To know more Go to Rstudio
To know more about Google Colab Go to Google Colab
NOTE: As I am not using Google Colab, the workflow explained between 00:00 to 35:10 is NOT covered here. If someone is using Google Colab, and is willing to share their notes, I would include those.
Base R graphs /plots as shown in figure 2.1
# #Flights: Arrival Time (Y) vs. Departure (X) Time (Arrival Time is linked to Departure Time)
plot(mydata$dep_time, mydata$arr_time)Figure 2.1 Flights: Arrival Time (Y) vs. Departure (X) Time
# #Create a Subset of Dataframe of 1000 Rows for quick calculations
bb <- head(mydata, 1000)
#
# #Dimensions: dim() Row x Column; nrow(); ncol()
dim(bb)
## [1] 1000 19
#
stopifnot(identical(nrow(bb), dim(bb)[1]))
stopifnot(identical(ncol(bb), dim(bb)[2]))# #Split a Dataframe by subsetting
data_1 <- bb[ ,1:8]
data_2 <- bb[ ,9:19]
# str(data_1)# #Merge a Dataframe by cbind()
data_3 <- cbind(data_1, data_2)
# #Equivalent
data_4 <- data.frame(data_1, data_2)
# str(bb_3)
stopifnot(identical(data_3, data_4))# #Row Split
data_5 <- bb[1:300, ]
data_6 <- bb[301:1000, ]
#
# #Equivalent
n_rows <- 300L
data_5 <- bb[1:n_rows, ]
data_6 <- bb[(n_rows+1L):nrow(bb), ]
#
stopifnot(identical(data_5, head(bb, n_rows)))
stopifnot(identical(data_6, tail(bb, (nrow(bb)-n_rows))))# #Merge a Dataframe by rbind()
data_7 <- rbind(data_5, data_6)
stopifnot(identical(bb, data_7))# #Change A Specific Name based on Index Ex: First Header "year" -> "YEAR"
# #NOTE: Output of 'names(bb)' is a character vector, not a dataframe
# #So, [1] is being used to subset for 1st element and NOT the [ , 1] (as done for dataframe)
(names(bb)[1] <- "YEAR")
## [1] "YEAR"
#
# #Change all Column Headers to Uppercase by toupper() or Lowercase by tolower()
names(bb) <- toupper(names(bb))NA can be coerced to any other vector type except raw. There are also constants like NA_integer_, NA_real_ etc. For checking only the presence of NA, anyNA() is faster than is.na()
Overview of ‘Not Available’
To remove all NA
bb <- xxflights
# #anyNA() is faster than is.na()
if(anyNA(bb)) print("NA are Present!") else print("NA not found")
## [1] "NA are Present!"
#
# #Columnwise NA Count
bb_na_col <- colSums(is.na(bb))
#
# #Vector of Columns having NA
which(bb_na_col != 0)
## dep_time dep_delay arr_time arr_delay tailnum air_time
## 4 6 7 9 12 15
stopifnot(identical(which(bb_na_col != 0), which(vapply(bb, anyNA, logical(1)))))
#
# #Indices of Rows with NA
head(which(!complete.cases(bb)))
## [1] 472 478 616 644 726 734
#
# #How many rows contain NA
sum(!complete.cases(bb))
## [1] 9430
#
# #How many rows have NA in specific Columns
sum(!complete.cases(bb[, c(6, 9, 4)]))
## [1] 9430# #Remove all rows which have any NA
# #na.omit(), complete.cases(), tidyr::drop_na(), rowSums(is.na())
bb_1 <- na.omit(bb)
# #Print the Count of removed rows containg NA
print(paste0("Note: ", length(attributes(bb_1)$na.action), " rows removed."))
## [1] "Note: 9430 rows removed."
#
# #Remove additional Attribute added by na.omit()
attr(bb_1, "na.action") <- NULL
#
# #Equivalent
bb_2 <- bb[complete.cases(bb), ]
bb_3 <- bb %>% drop_na()
bb_4 <- bb[rowSums(is.na(bb)) == 0, ]
#Validation
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))
#
# #complete.cases also allow partial selection of specific columns
# #Remove rows which have NA in some columns i.e. ignore NA in other columns
dim(bb[complete.cases(bb[ , c(6, 9, 4)]), ])
## [1] 327346 19
# #Equivalent
dim(bb %>% drop_na(dep_delay, arr_delay, dep_time))
## [1] 327346 19
#
# #Remove rows which have more than allowed number of NA (ex:4) in any column
# #Caution: In general, this is not recommended because random columns retain NA
dim(bb[rowSums(is.na(bb)) <= 4L, ])
## [1] 328521 19Sources: Grouping Functions and the Apply Family, Why is vapply safer than sapply, Hadley - Advanced R - Functionals, This, This, & This
Apply Function in R are designed to avoid explicit use of loop constructs.
# #Subset Dataframe
bb <- xxflights
data_8 <- bb[ , c("dep_delay", "arr_delay", "dep_time")]
#data_8 <- bb %>% select(dep_delay, arr_delay, dep_time)
#
# #Remove missing values
data_9 <- na.omit(data_8)
#
# #Calculate Columnwise Mean
(bb_1 <- apply(data_9, 2, mean))
## dep_delay arr_delay dep_time
## 12.555156 6.895377 1348.789883
bb_2 <- unlist(lapply(data_9, mean))
bb_3 <- sapply(data_9, mean)
bb_4 <- vapply(data_9, mean, numeric(1))
#
stopifnot(all(identical(bb_1, bb_2), identical(bb_1, bb_3), identical(bb_1, bb_4)))Refer The 6 Datatypes of Atomic Vectors
Create a Basic Tibble, Table2.1, for evaluating ‘is.x()’ series of functions in Base R
| ii | dd | cc | ll | ff | fo | dtm | dat |
|---|---|---|---|---|---|---|---|
| 1 | 1 | a | FALSE | odd | odd | 2021-12-11 01:14:57 | 2021-12-12 |
| 2 | 2 | b | TRUE | even | even | 2021-12-11 01:14:58 | 2021-12-13 |
| 3 | 3 | c | FALSE | odd | odd | 2021-12-11 01:14:59 | 2021-12-14 |
| 4 | 4 | d | TRUE | even | even | 2021-12-11 01:15:00 | 2021-12-15 |
| 5 | 5 | e | FALSE | odd | odd | 2021-12-11 01:15:01 | 2021-12-16 |
| 6 | 6 | f | TRUE | even | even | 2021-12-11 01:15:02 | 2021-12-17 |
# #Validation
# #anyNA() is TRUE if there is an NA present, FALSE otherwise
vapply(bb, anyNA, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.atomic() is TRUE for All Atomic Vectors, factor, matrix but NOT for list
vapply(bb, is.atomic, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#
# #is.vector() is TRUE for All Atomic Vectors, list but NOT for factor, matrix, DATE & POSIXct
# #CAUTION: With vapply() it returns TRUE for matrix (it checks individual elements)
# #CAUTION: FALSE if the vector has attributes (except names) ex: DATE & POSIXct
vapply(bb, is.vector, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#
# #is.numeric() is TRUE for both integer and double
vapply(bb, is.numeric, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.integer() is TRUE only for integer
vapply(bb, is.integer, logical(1))
## ii dd cc ll ff fo dtm dat
## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#
# #is.double() is TRUE only for double
vapply(bb, is.double, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.character() is TRUE only for character
vapply(bb, is.character, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#
# #is.logical() is TRUE only for logical
vapply(bb, is.logical, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE# #Factors
# #is.factor() is TRUE only for factor (unordered or ordered)
vapply(bb, is.factor, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #is.ordered() is TRUE only for ordered factor
vapply(bb, is.ordered, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
#
# #nlevels()
vapply(bb, nlevels, integer(1))
## ii dd cc ll ff fo dtm dat
## 0 0 0 0 2 2 0 0
#
# #levels()
vapply(bb, function(x) !is.null(levels(x)), logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
#
# #table()
table(bb$ff)
##
## even odd
## 3 3# #Package lubridate covers the missing functions for POSIXct, POSIXlt, or Date
# #is.timepoint() is TRUE for POSIXct, POSIXlt, or Date
vapply(bb, is.timepoint, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
#
# #is.POSIXt() is TRUE only for POSIXct
vapply(bb, is.POSIXt, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
#
# #is.Date() is only TRUE for DATE
vapply(bb, is.Date, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE# #Which Columns have Duplicate Values
vapply(bb, function(x) anyDuplicated(x) != 0L, logical(1))
## ii dd cc ll ff fo dtm dat
## FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSEThey can store both strings and integers. They are useful in the columns which have a limited number of unique values. Like “Male, Female” and “True, False” etc. They are useful in data analysis for statistical modelling.
Factor is nothing but the numeric representation of the character vector.
as.factor() vs. factor()
str(bb$ll)
## logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
# #Coercion to Factor
bb$new <- as.factor(bb$ll)
str(bb$new)
## Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
#
# #table()
table(bb$ll)
##
## FALSE TRUE
## 3 3
table(bb$new)
##
## FALSE TRUE
## 3 3
#
# #Levels can be Labelled differently also
str(bb$ff)
## Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
# #
str(factor(bb$ff, levels = c("even", "odd"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 2 1 2 1 2 1
str(factor(bb$ff, levels = c("odd", "even"), labels = c("day", "night")))
## Factor w/ 2 levels "day","night": 1 2 1 2 1 2
#
# #Coercion from Factor to character, logical etc.
bb$xcc <- as.character(bb$new)
bb$xll <- as.logical(bb$new)
#
str(bb)
## tibble [6 x 11] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2021-12-11 01:14:57" "2021-12-11 01:14:58" "2021-12-11 01:14:59" ...
## $ dat: Date[1:6], format: "2021-12-12" "2021-12-13" "2021-12-14" ...
## $ new: Factor w/ 2 levels "FALSE","TRUE": 1 2 1 2 1 2
## $ xcc: chr [1:6] "FALSE" "TRUE" "FALSE" "TRUE" ...
## $ xll: logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUEbb <- xxflights
aa <- c("month", "day")
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : num [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
# #To factor
bb$day <- as.factor(bb$day)
bb$month <- as.factor(bb$month)
# #Equivalent
#bb[aa] <- lapply(bb[aa], as.factor)
str(bb[aa])
## tibble [336,776 x 2] (S3: tbl_df/tbl/data.frame)
## $ month: Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ day : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...Caution: The only thing you need to take care of, is that you do not give two elements the same name. R will NOT throw ERROR.
Due to the resulting two-dimensional structure, data frames can mimic some of the behaviour of matrices. You can select rows and do operations on rows. You can not do that with lists, as a row is undefined there.
A Dataframe is intended to be used as a relational table. This means that elements in the same column are related to each other in the sense that they are all measures of the same metric. And, elements in the same row are related to each other in the sense that they are all measures from the same observation or measures of the same item. This is why when you look at the structure of a Dataframe, it will state the the number of observations and the number of variables instead of the number of rows and columns.
Dataframes are distinct from Matrices because they can include heterogenous data types among columns/variables. Dataframes do not permit multiple data types within a column/variable, for reasons that also follow from the relational table idea.
All this implies that you should use a data frame for any dataset that fits in that two-dimensional structure. Essentially, you use data frames for any dataset where a column coincides with a variable and a row coincides with a single observation in the broad sense of the word. For all other structures, lists are the way to go.
# #CAUTION: Don't Create a list with duplicate names (R will NOT throw ERROR)
bb <- list(a=1, b=2, a=3)
# # 3rd index can't be accessed using $
bb$a
## [1] 1
identical(bb$a, bb[[1]])
## [1] TRUE
identical(bb$a, bb[[3]])
## [1] FALSE
bb[[3]]
## [1] 3# #Create a list
bb_lst <- list( a = c(1, 2), b = c('a', 'b', 'c'))
tryCatch(
# #Trying to create varying length of variables in dataframe like in list
bb_dft <- data.frame(a = c(1,2), b = c('a', 'b', 'c')),
error = function(e) {
# #Print ERROR
cat(paste0(e))
# #Double Arrow Assignment '<<-' to assign in parent environment
bb_dft <<- data.frame(a = c(1, 2), b = c('a', 'b'))
}
)
## Error in data.frame(a = c(1, 2), b = c("a", "b", "c")): arguments imply differing number of rows: 2, 3
#
# #Both list and dataframe have same type()
typeof(bb_lst)
## [1] "list"
typeof(bb_dft)
## [1] "list"
#
# #But, class() is different for list and dataframe
class(bb_lst)
## [1] "list"
class(bb_dft)
## [1] "data.frame"
#
str(bb_lst)
## List of 2
## $ a: num [1:2] 1 2
## $ b: chr [1:3] "a" "b" "c"
str(bb_dft)
## 'data.frame': 2 obs. of 2 variables:
## $ a: num 1 2
## $ b: chr "a" "b"
#
# #Although 'bb_lst_c' is a list but inside coercion takes place i.e. '9' is character
bb_lst_c <- list( a = c(8, 'x'), b = c('y', 9))
str(bb_lst_c[[2]][2])
## chr "9"
#
# #Here, '9' is numeric, it is stored as list element so note the extra [[]]
bb_lst_l <- list( a = list(8, 'x'), b = list('y', 9))
str(bb_lst_l[[2]][[2]])
## num 9# #Create a Matrix
bb_mat <- matrix(1:6, nrow = 2, ncol = 3)
print(bb_mat)
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
str(bb_mat)
## int [1:2, 1:3] 1 2 3 4 5 6
class(bb_mat)
## [1] "matrix" "array"
typeof(bb_mat)
## [1] "integer"# #Basic Tibble
bb <- xxbasic10
str(bb)
## tibble [6 x 8] (S3: tbl_df/tbl/data.frame)
## $ ii : int [1:6] 1 2 3 4 5 6
## $ dd : num [1:6] 1 2 3 4 5 6
## $ cc : chr [1:6] "a" "b" "c" "d" ...
## $ ll : logi [1:6] FALSE TRUE FALSE TRUE FALSE TRUE
## $ ff : Factor w/ 2 levels "even","odd": 2 1 2 1 2 1
## $ fo : Ord.factor w/ 2 levels "even"<"odd": 2 1 2 1 2 1
## $ dtm: POSIXct[1:6], format: "2021-12-11 01:14:57" "2021-12-11 01:14:58" "2021-12-11 01:14:59" ...
## $ dat: Date[1:6], format: "2021-12-12" "2021-12-13" "2021-12-14" ...
# #Split with 'cc' as common ID column
bb_a <- bb[1:3]
bb_b <- bb[3:ncol(bb)]
#
# #merge() using the common ID column 'cc'
bb_new <- merge(bb_a, bb_b, by = "cc")
bb_new
## cc ii dd ll ff fo dtm dat
## 1 a 1 1 FALSE odd odd 2021-12-11 01:14:57 2021-12-12
## 2 b 2 2 TRUE even even 2021-12-11 01:14:58 2021-12-13
## 3 c 3 3 FALSE odd odd 2021-12-11 01:14:59 2021-12-14
## 4 d 4 4 TRUE even even 2021-12-11 01:15:00 2021-12-15
## 5 e 5 5 FALSE odd odd 2021-12-11 01:15:01 2021-12-16
## 6 f 6 6 TRUE even even 2021-12-11 01:15:02 2021-12-17bb <- xxflights
# #Sort ascending (default)
bb_1 <- bb[order(bb$dep_delay), ]
# #Sort descending
bb_2 <- bb[order(-bb$dep_delay), ]
#
bb[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 517 2 N14228 UA
## 2 533 4 N24211 UA
## 3 542 2 N619AA AA
## 4 544 -1 N804JB B6
## 5 554 -6 N668DN DL
bb_1[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 2040 -43 N592JB B6
## 2 2022 -33 N612DL DL
## 3 1408 -32 N825AS EV
## 4 1900 -30 N934DL DL
## 5 1703 -27 N208FR F9
bb_2[1:5, c("dep_time", "dep_delay", "tailnum", "carrier")]
## # A tibble: 5 x 4
## dep_time dep_delay tailnum carrier
## <dbl> <dbl> <chr> <chr>
## 1 641 1301 N384HA HA
## 2 1432 1137 N504MQ MQ
## 3 1121 1126 N517MQ MQ
## 4 1139 1014 N338AA AA
## 5 845 1005 N665MQ MQbb <- xxbasic10
bb
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2021-12-11 01:14:57 2021-12-12
## 2 2 2 b TRUE even even 2021-12-11 01:14:58 2021-12-13
## 3 3 3 c FALSE odd odd 2021-12-11 01:14:59 2021-12-14
## 4 4 4 d TRUE even even 2021-12-11 01:15:00 2021-12-15
## 5 5 5 e FALSE odd odd 2021-12-11 01:15:01 2021-12-16
## 6 6 6 f TRUE even even 2021-12-11 01:15:02 2021-12-17
# #Sort ascending (default)
(bb_1 <- bb[order(bb$ll), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 1 1 a FALSE odd odd 2021-12-11 01:14:57 2021-12-12
## 2 3 3 c FALSE odd odd 2021-12-11 01:14:59 2021-12-14
## 3 5 5 e FALSE odd odd 2021-12-11 01:15:01 2021-12-16
## 4 2 2 b TRUE even even 2021-12-11 01:14:58 2021-12-13
## 5 4 4 d TRUE even even 2021-12-11 01:15:00 2021-12-15
## 6 6 6 f TRUE even even 2021-12-11 01:15:02 2021-12-17
# #Sort on Multiple Columns with ascending and descending
(bb_2 <- bb[order(bb$ll, -bb$dd), ])
## # A tibble: 6 x 8
## ii dd cc ll ff fo dtm dat
## <int> <dbl> <chr> <lgl> <fct> <ord> <dttm> <date>
## 1 5 5 e FALSE odd odd 2021-12-11 01:15:01 2021-12-16
## 2 3 3 c FALSE odd odd 2021-12-11 01:14:59 2021-12-14
## 3 1 1 a FALSE odd odd 2021-12-11 01:14:57 2021-12-12
## 4 6 6 f TRUE even even 2021-12-11 01:15:02 2021-12-17
## 5 4 4 d TRUE even even 2021-12-11 01:15:00 2021-12-15
## 6 2 2 b TRUE even even 2021-12-11 01:14:58 2021-12-13
#
stopifnot(identical(bb_2, arrange(bb, ll, -dd)))“Understanding basic data manipulation using R”
Links (Ref)
# #To get the Help files on any Topic including 'loaded' Packages
?dplyr
?mutate
# #To get the Help files on any Topic including functions from 'not loaded' Packages
?dplyr::mutate()
# #Operators need Backticks i.e. ` . In keyboards it is located below 'Esc' Key
?`:`Overview
TRUE, FALSE, or NA.# #At lease one TRUE is present
NA | TRUE
## [1] TRUE
# #Depending upon what the unknown is, the outcome will change
NA | FALSE
## [1] NA
# #Depending upon what the unknown is, the outcome will change
NA & TRUE
## [1] NA
# #At lease one FALSE is present
NA & FALSE
## [1] FALSE
#
# #For length 1 vectors, output of vectorised and non-vectorised forms is same
stopifnot(all(identical(NA || TRUE, NA | TRUE), identical(NA || FALSE, NA | FALSE),
identical(NA && TRUE, NA & TRUE), identical(NA && FALSE, NA & FALSE)))
#
# #But for vectors of >1 length, output is different
x <- 1:5
y <- 5:1
(x > 2) & (y < 3)
## [1] FALSE FALSE FALSE TRUE TRUE
(x > 2) && (y < 3)
## [1] FALSE
#
# # '&&' evaluates only the first element of Vector, thus caution is advised
TRUE & c(TRUE, FALSE)
## [1] TRUE FALSE
TRUE & c(FALSE, FALSE)
## [1] FALSE FALSE
TRUE && c(TRUE, FALSE)
## [1] TRUE
TRUE && c(FALSE, FALSE)
## [1] FALSE
TRUE && all(c(TRUE, FALSE))
## [1] FALSE
TRUE && any(c(TRUE, FALSE))
## [1] TRUEif(exists("x")) rm(x)
exists("x")
## [1] FALSE
#
# # No short-circuit for "|" or "&", Evaluates Right and throws Error
tryCatch( TRUE | x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
tryCatch( FALSE & x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found
#
# #Doesn't evaluate Right input because outcome already determined
tryCatch( TRUE || x, error = function(e) cat(paste0(e)))
## [1] TRUE
tryCatch( FALSE && x, error = function(e) cat(paste0(e)))
## [1] FALSE
# #evaluates Right input because outcome can't be determined and throws error
tryCatch( TRUE && x, error = function(e) cat(paste0(e)))
## Error in doTryCatch(return(expr), name, parentenv, handler): object 'x' not found# #any()
any(NA, TRUE)
## [1] TRUE
any(NA, FALSE)
## [1] NA
any(NA, TRUE, na.rm = TRUE)
## [1] TRUE
any(NA, FALSE, na.rm = TRUE)
## [1] FALSE
any(character(0))
## [1] FALSE
#
# #all()
all(NA, TRUE)
## [1] NA
all(NA, FALSE)
## [1] FALSE
all(NA, TRUE, na.rm = TRUE)
## [1] TRUE
all(NA, FALSE, na.rm = TRUE)
## [1] FALSE
all(character(0))
## [1] TRUE\(>\) , \(<\) , \(==\) , \(>=\) , \(<=\) , \(!=\)
# #dplyr::filter() - Filter Rows based on Multiple Columns
bb_1 <- filter(bb, month == 1, day == 1)
dim(bb_1)
## [1] 842 19
# #Filtering by multiple criteria within a single logical expression
stopifnot(identical(bb_1, filter(bb, month == 1 & day == 1)))
#
if(anyNA(bb_1)) {
bb_na <- na.omit(bb_1)
print(paste0("Note: ", length(attributes(bb_na)$na.action), " rows removed."))
} else {
print("NA not found")
}
## [1] "Note: 11 rows removed."
dim(bb_na)
## [1] 831 19dim(bb)
## [1] 336776 19
#
# #Flights in either months of November or Decemeber
dim(bb_2 <- filter(bb, month == 11 | month == 12))
## [1] 55403 19
#
# #Flights with arrival delay '<= 120' or departure delay '<= 120'
# #It excludes flights where arrival & departure BOTH are delayed by >2 hours
# #If either delay is less than 2 hours, the flight is included
dim(bb_3 <- filter(bb, arr_delay <= 120 | dep_delay <= 120))
## [1] 320060 19
dim(bb_4 <- filter(bb, !(arr_delay > 120 & dep_delay > 120)))
## [1] 320060 19
dim(bb_5 <- filter(bb, (!arr_delay > 120 | !dep_delay > 120)))
## [1] 320060 19
#
# #Destination to IAH or HOU
dim(bb_6 <- filter(bb, dest == "IAH" | dest == "HOU"))
## [1] 9313 19
dim(bb_7 <- filter(bb, dest %in% c("IAH", "HOU")))
## [1] 9313 19
#
# #Carrier being "UA", "US", "DL"
dim(bb_8 <- filter(bb, carrier == "UA" | carrier == "US" | carrier == "DL"))
## [1] 127311 19
dim(bb_9 <- filter(bb, carrier %in% c("UA", "US", "DL")))
## [1] 127311 19
#
# #Didn't leave late (before /on time departure) but Arrived late by >2 hours
dim(bb_10 <- filter(bb, (arr_delay > 120) & !(dep_delay > 0)))
## [1] 29 19
#
# #Departed between midnight and 6 AM (inclusive)
dim(bb_11 <- filter(bb, (sched_dep_time >= 00 & sched_dep_time <= 600)))
## [1] 8970 19# #subset() - Recommendation is against its usage. Use either '[]' or filter()
dim(bb_12 <- subset(bb, month == 1 | !(dep_delay >= 120),
select = c("flight", "arr_delay")))
## [1] 319760 2
dim(bb_13 <- subset(bb, month == 1 | !(dep_delay >= 120) | carrier == "DL",
select = c("flight", "arr_delay")))
## [1] 321139 2\([ \ \ ]\) , \([[ \ \ ]]\) , \(\$\)
dplyr::select()
dim(bb)
## [1] 336776 19
#
# #Subset Consecutive Columns using Colon
stopifnot(identical(bb[ , 2:5], bb[ , -c(1, 6:ncol(bb))]))
#
# #dplyr::select()
bb_14 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_15 <- bb %>% select(year:day, arr_delay, dep_delay, distance, air_time)
stopifnot(identical(bb_14, bb_15))dim(bb)
## [1] 336776 19
#
# #Subset Rows
dim(bb[which(bb$day == 1 & !(bb$month ==1)), ])
## [1] 10194 19
dim(bb[which(bb$day == 1 | bb$month ==1), ])
## [1] 37198 19
dim(bb[which(bb$day == 1 & bb$month ==1), ])
## [1] 842 19
dim(bb[which(bb$day == 1, bb$month ==1), ])
## [1] 11036 19
dim(bb[which(bb$day == 1 & !(bb$carrier == "DL")), ])
## [1] 9482 19
dim(bb[which(bb$day == 1 | bb$carrier == "DL"), ])
## [1] 57592 19
dim(bb[which(bb$day == 1 & bb$carrier == "DL"), ])
## [1] 1554 19
dim(bb[which(bb$day == 1, bb$carrier == "DL"), ])
## [1] 11036 19bb <- xxflights
# #dplyr::summarise() & dplyr::summarize() are same
# #Get the mean of a column with NA excluded
#
summarize(bb, delay_mean = mean(dep_delay, na.rm = TRUE))
## # A tibble: 1 x 1
## delay_mean
## <dbl>
## 1 12.6
#
# #base::summary()
summary(bb$dep_delay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
#
# #Grouped Summary
by_ymd <- group_by(bb, year, month, day)
mysum <- summarize(by_ymd,
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups = "keep")
# #Equivalent
bb %>%
group_by(year, month, day) %>%
summarize(dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 365 x 5
## # Groups: year, month, day [365]
## year month day dep_delay_mean arr_delay_mean
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7
## 2 2013 1 2 13.9 12.7
## 3 2013 1 3 11.0 5.73
## 4 2013 1 4 8.95 -1.93
## 5 2013 1 5 5.73 -1.53
## 6 2013 1 6 7.15 4.24
## 7 2013 1 7 5.42 -4.95
## 8 2013 1 8 2.55 -3.23
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90
## # ... with 355 more rows# #Get delay grouped by distance 'Distance between airports, in miles.'
summary(bb$distance)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17 502 872 1040 1389 4983
#
# #How many unique values are present in this numeric data i.e. factors
str(as.factor(bb$distance))
## Factor w/ 214 levels "17","80","94",..: 163 165 145 171 106 96 138 22 120 99 ...
str(sort(unique(bb$distance)))
## num [1:214] 17 80 94 96 116 143 160 169 173 184 ...
bb %>%
group_by(distance) %>%
summarize(count = n(),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE),
.groups= "keep")
## # A tibble: 214 x 4
## # Groups: distance [214]
## distance count dep_delay_mean arr_delay_mean
## <dbl> <int> <dbl> <dbl>
## 1 17 1 NaN NaN
## 2 80 49 18.9 16.5
## 3 94 976 17.5 12.7
## 4 96 607 3.19 5.78
## 5 116 443 17.7 7.05
## 6 143 439 23.6 14.4
## 7 160 376 21.8 16.2
## 8 169 545 18.5 15.1
## 9 173 221 7.05 -0.286
## 10 184 5504 3.07 0.123
## # ... with 204 more rows
#
# #For distance =17, there is only 1 flight and that too has NA, so the mean is NaN
bb[bb$distance == 17, ]
## # A tibble: 1 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 7 27 NA 106 NA NA 245 NA US
## # ... with 9 more variables: flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
#
# #In general, Flight to any destination (ex: ABQ) has travelled same distance (1826)
unique(bb %>% filter(dest == "ABQ") %>% select(distance))
## # A tibble: 1 x 1
## distance
## <dbl>
## 1 1826
#
# #Mean Delays for Destinations with more than 1000 miles distance
bb %>%
group_by(dest) %>%
filter(distance > 1000) %>%
summarize(count = n(),
distance_mean = mean(distance, na.rm = TRUE),
dep_delay_mean = mean(dep_delay, na.rm = TRUE),
arr_delay_mean = mean(arr_delay, na.rm = TRUE))
## # A tibble: 48 x 5
## dest count distance_mean dep_delay_mean arr_delay_mean
## <chr> <int> <dbl> <dbl> <dbl>
## 1 ABQ 254 1826 13.7 4.38
## 2 ANC 8 3370 12.9 -2.5
## 3 AUS 2439 1514. 13.0 6.02
## 4 BQN 896 1579. 12.4 8.25
## 5 BUR 371 2465 13.5 8.18
## 6 BZN 36 1882 11.5 7.6
## 7 DEN 7266 1615. 15.2 8.61
## 8 DFW 8738 1383. 8.68 0.322
## 9 DSM 569 1021. 26.2 19.0
## 10 EGE 213 1736. 15.5 6.30
## # ... with 38 more rowsdim(bb)
## [1] 336776 19
#
bb_16 <- select(bb, year:day, arr_delay, dep_delay, distance, air_time)
bb_17 <- mutate(bb_16,
gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
# #Equivalent
bb %>%
select(year:day, arr_delay, dep_delay, distance, air_time) %>%
mutate(gain = arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
## year month day arr_delay dep_delay distance air_time gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 11 2 1400 227 9 370. 3.78 2.38
## 2 2013 1 1 20 4 1416 227 16 374. 3.78 4.23
## 3 2013 1 1 33 2 1089 160 31 408. 2.67 11.6
## 4 2013 1 1 -18 -1 1576 183 -17 517. 3.05 -5.57
## 5 2013 1 1 -25 -6 762 116 -19 394. 1.93 -9.83
## 6 2013 1 1 12 -4 719 150 16 288. 2.5 6.4
## 7 2013 1 1 19 -5 1065 158 24 404. 2.63 9.11
## 8 2013 1 1 -14 -3 229 53 -11 259. 0.883 -12.5
## 9 2013 1 1 -8 -3 944 140 -5 405. 2.33 -2.14
## 10 2013 1 1 8 -2 733 138 10 319. 2.3 4.35
## # ... with 336,766 more rowsInferential statistics are used for Hypothesis Testing. Refer Statistical Inference
Refer Hypothesis Testing
Refer Steps of Hypothesis Testing
Question: Is there an ideal sample size
Example: \({\overline{x}}\) is an estimator (of populataion parameter ‘mean’ \({\mu}\)). Its estimate is 3 and this calculation process is an estimation.
7.6 Given a data set \({X=\{x_1,x_2,\ldots,x_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({x_1,x_2,\ldots,x_n}\) divided by the count \({n}\).
Refer Standard Deviation and equation (7.12)
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2} \end{align}\]
A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range.
Refer Variance and equation (7.11)
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align}\]
Variability is most commonly measured with the Range, IQR, SD, and Variance.
The sample we draw from the population is only one from a large number of potential samples.
Refer Standard Error
11.14 A sampling error is the difference between a population parameter and a sample statistic.
Sampling fluctuation (Standard Error) refers to the extent to which a statistic (mean, median, mode, sd etc.) takes on different values with different samples i.e. it refers to how much the value of the statistic fluctuates from sample to sample.
11.12 The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
Standard Deviation of \({\overline{x}}\), \(\sigma_{\overline{x}}\) is given by equation (11.1) i.e. \(\sigma_{\overline{x}} = \frac{\sigma}{\sqrt{n}}\)
Refer Test Statistic
13.11 Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
For hypothesis tests about a population mean in the \({\sigma}\) known case, we use the standard normal random variable \({z}\) as a test statistic to determine whether \({\overline{x}}\) deviates from the hypothesized value of \({\mu}\) enough to justify rejecting the null hypothesis. As given in equation (13.1) i.e. \(z = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}} = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}}\)
Standard Error (SE) is same as ‘the standard deviation of the sampling distribution.’ The ‘variance of the sampling distribution’ is the Variance of the data divided by N.
# #DataSet: Height of 5 people in 'cm'
hh <- c(170.5, 161, 160, 170, 150.5)
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #Mean by mean()
hh_mean <- mean(hh)
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Variance by var()
hh_var <- round(var(hh), 3)
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Standard Deviation (SD) by sd()
hh_sd <- round(sd(hh), 3)
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.257
#
# #Standard Error (SE)
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693# #DataSet: Height of 5 people in 'cm'
print(hh)
## [1] 170.5 161.0 160.0 170.0 150.5
#
# #N by length()
print(hh_len <- length(hh))
## [1] 5
#
# #sum by sum()
print(hh_sum <- sum(hh))
## [1] 812
#
# #Mean by mean()
hh_mean <- mean(hh)
hh_mean_cal <- hh_sum / hh_len
stopifnot(identical(hh_mean, hh_mean_cal))
cat("Mean = ", hh_mean)
## Mean = 162.4
#
# #Calculate the deviation from the mean by subtracting each value from the mean
print(hh_dev <- hh - hh_mean)
## [1] 8.1 -1.4 -2.4 7.6 -11.9
#
# #Square the deviation
print(hh_sqdev <- hh_dev^2)
## [1] 65.61 1.96 5.76 57.76 141.61
#
# #Get Sum of the squared deviations
print(hh_sqdev_sum <- sum(hh_sqdev))
## [1] 272.7
#
# #Divide it by the 'sample size (N) – 1' for the Variance or use var()
hh_var <- round(var(hh), 3)
hh_var_cal <- hh_sqdev_sum / (hh_len -1)
stopifnot(identical(hh_var, hh_var_cal))
cat("Variance = ", hh_var)
## Variance = 68.175
#
# #Variance of the sampling distribution
hh_var_sample <- hh_var / hh_len
cat("Variance of the Sampling Distribution = ", hh_var)
## Variance of the Sampling Distribution = 68.175
#
# #Take square root of the Variance for the Standard Deviation (SD) or use sd()
hh_sd_cal <- round(sqrt(hh_var), 3)
hh_sd <- sd(hh)
stopifnot(identical(round(hh_sd, 3), hh_sd_cal))
cat("Standard Deviation (SD) = ", hh_sd)
## Standard Deviation (SD) = 8.256815
#
# #Standard Error (SE)
# #SE
# #Divide the SD by the square root of the sample size for the Standard Error (SE)
# #
hh_se_sd <- round(hh_sd / sqrt(hh_len), 3)
#
# #Calculate SE from Variance
hh_se_var <- round(sqrt(hh_var_sample), 3)
stopifnot(identical(hh_se_sd, hh_se_var))
cat("Standard Error (SE) = ", hh_se_sd)
## Standard Error (SE) = 3.693Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes. Refer figure 4.1
Figure 4.1 Flights: Air Time (min) excluding NA (Histogram and Density)
# #Remove All NA
aa <- na.omit(xxflights$air_time)
attr(aa, "na.action") <- NULL
str(aa)
## num [1:327346] 227 227 160 183 116 150 158 53 140 138 ...
summary(aa)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0# #Overview of Data after removal of NA
bb <- aa
stopifnot(is.null(dim(bb)))
summary(bb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 82.0 129.0 150.7 192.0 695.0
# #min(), max(), range(), summary()
min_bb <- summary(bb)[1]
max_bb <- summary(bb)[6]
range_bb <- max_bb - min_bb
cat(paste0("Range = ", range_bb, " (", min_bb, ", ", max_bb, ")\n"))
## Range = 675 (20, 695)
# #IQR(), summary()
iqr_bb <- IQR(bb)
cat(paste0("IQR = ", iqr_bb, " (", summary(bb)[2], ", ", summary(bb)[5], ")\n"))
## IQR = 110 (82, 192)
# #median(), mean(), summary()[3], summary()[4]
median_bb <- median(bb)
cat("Median =", median_bb, "\n")
## Median = 129
mu_mean_bb <- mean(bb)
cat("Mean \u03bc =", mu_mean_bb, "\n")
## Mean µ = 150.6865
#
sigma_sd_bb <- sd(bb)
cat("SD (sigma) \u03c3 =", sigma_sd_bb, "\n")
## SD (sigma) s = 93.6883
#
variance_bb <- var(bb)
cat(sprintf('Variance (sigma)%s %s%s =', '\u00b2', '\u03c3', '\u00b2'), variance_bb, "\n")
## Variance (sigma)² s² = 8777.498# #Histogram
bb <- na.omit(xxflights$air_time)
hh <- tibble(ee = bb)
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
B12P01 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean_hh), color = '#440154FF') +
geom_text(aes(label = paste0("Mean= ", round(mean_hh, 1)), x = mean_hh, y = -Inf),
color = '#440154FF', hjust = -0.5, vjust = 1.3, angle = 90, check_overlap = TRUE) +
geom_vline(aes(xintercept = median_hh), color = '#3B528BFF') +
geom_text(aes(label = paste0("Median= ", round(median_hh, 1)), x = median_hh, y = -Inf),
color = '#3B528BFF', hjust = -0.5, vjust = -0.7, angle = 90, check_overlap = TRUE) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = "B12P01", title = "Flights: Air Time")
}# #Density Curve
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]],.05)
q95_hh <- quantile(hh[[1]],.95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
B12P02 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
#scale_y_continuous(limits = c(0, 0.009), breaks = seq(0, 0.009, 0.003)) +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
ggplot2::annotate("segment", x = xpoints_hh[4] - 0.5 * sd_hh, xend = xpoints_hh[2], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] + 0.5 * sd_hh, xend = xpoints_hh[6], y = 0.007,
yend = 0.007, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.007, label = "95.4%") +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Density",
subtitle = paste0("(N=", nrow(.), "; ", "Mean= ", round(mean(.[[1]]), 1),
"; Median= ", round(median(.[[1]]), 1), "; SD= ", round(sd(.[[1]]), 1),
")"),
caption = "B12P02", title = "Flights: Air Time")
}Using Dataset Flights : “air_time” -Amount of time spent in the air, in minutes.
Caution: Trend here does not match with the theory. However, the exercise shows the ‘How to do it’ part. It can be repeated with better data, larger sample size, or repeat sampling.
Figure 4.2 Effect of Increasing Sample Size
Figure 4.3 Effect of Increasing Sample Size
bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed()
set.seed(3)
# #Set Sample Size
#nn <- 100L
# #Take a sample from dataset
xb100 <- sample(bb, size = 100L)
xb1000 <- sample(bb, size = 1000L)
xb10000 <- sample(bb, size = 10000L)
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 100
hh <- tibble(ee = xb100)
ylim_hh <- 12.5
caption_hh <- "B12P03"# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 2, vjust = -2.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 1),
"; SD= ", round(sd(.[[1]]), 1),
#"; Var= ", round(var(.[[1]]), 1),
"; SE= ", round(sd(.[[1]]) / sqrt(nrow(.)), 1),
")"),
caption = caption_hh, title = paste0("Sample Size = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)Figure 4.4 Effect of Increasing Sample Size
Figure 4.5 Effect of Increasing Sampling
bb <- na.omit(xxflights$air_time)
# #Pseudo Random Number Generation by set.seed()
set.seed(3)
# #Set Sample Size
nn <- 10L
# #Set Repeat Sampling Rate
rr <- 20L
# #Take Sample of N = 10, get mean, repeat i.e. get distribution of mean
xr20 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 200L
xr200 <- replicate(rr, mean(sample(bb, size = nn)))
rr <- 2000L
xr2000 <- replicate(rr, mean(sample(bb, size = nn)))
#
# #Population Mean
mu_hh <- round(mean(bb), 1)
# #Histogram: N = 10, Repeat = 20
hh <- tibble(ee = xr20)
ylim_hh <- 2
caption_hh <- "B12P06"# #Assumes 'hh' has data in 'ee'. In: mu_hh, caption_hh, ylim_hh, nn
#
B12 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
geom_text(aes(label = TeX(r'($E(\bar{x})$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = 1.5, vjust = -1.5, parse = TRUE, check_overlap = TRUE) +
geom_vline(aes(xintercept = mu_hh), color = '#3B528BFF') +
geom_text(aes(label = TeX(r'($\mu$)', output = "character"), x = mu_hh, y = -Inf),
color = '#3B528BFF', hjust = -1, vjust = -2, parse = TRUE, check_overlap = TRUE) +
coord_cartesian(xlim = c(0, 800), ylim = c(0, ylim_hh)) +
theme(plot.title.position = "panel") +
labs(x = TeX(r'($\bar{x} \, (\neq x)$)'), y = TeX(r'(Frequency of $\, \bar{x}$)'),
subtitle = TeX(sprintf(
"($\\mu$=%.0f) $E(\\bar{x}) \\, =$%.0f; $\\sigma_{\\bar{x}} \\, =$%.0f",
mu_hh, round(mean(.[[1]]), 1), round(sd(.[[1]])))),
caption = caption_hh,
title = paste0("Sampling Distribution (N = ", nn, ") & Repeat Sampling = ", nrow(.)))
}
assign(caption_hh, B12)
rm(B12)Refer Normal Distribution and equation (10.2)
10.3 A normal distribution (\({\mathcal {N}}_{(\mu,\, \sigma^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
Their importance is partly due to the Central Limit Theorem. Assumption of normal distribution allow us application of Parametric Methods
22.1 Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
11.15 Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
It states that, under some conditions, the average of many samples (observations) of a random variable with finite mean and variance is itself a random variable—whose distribution converges to a normal distribution as the number of samples increases.
Parametric statistical tests typically assume that samples come from normally distributed populations, but the central limit theorem means that this assumption is not necessary to meet when you have a large enough sample. A sample size of 30 or more is generally considered large.
This is the basis of Empirical Rule.
7.20 Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
Caution: If data from small samples do not closely follow this pattern, then other distributions like the t-distribution may be more appropriate.
Refer Standard Normal and equation (10.3)
10.4 A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal {z}}_{({\mu} = 0,\, {\sigma} = 1)}\)
The simplest case of a normal distribution is known as the standard normal distribution. Given the Population with normal distribution \({\mathcal {N}}_{(\mu,\, \sigma)}\)
If \(\overline {X}\) is the mean of a sample of size \({n}\) from this population, then the standard error is \(\sigma/{\sqrt{n}}\) and thus the z-score is \(Z=\frac {\overline {X}-\mu }{\sigma/{\sqrt{n}}}\)
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it is the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
Refer Outliers
7.21 Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
Figure 4.6 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
Example
Since we are using sample data to make inferences about the population, it is possible that we will make an error. In the case of the Null Hypothesis, we can make one of two errors.
Refer Type I and Type II Errors
13.7 The error of rejecting \({H_0}\) when it is true, is Type I error.
13.8 The error of accepting \({H_0}\) when it is false, is Type II error.
13.9 The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
12.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
13.22 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
There is always a tradeoff between Type-I and Type-II errors.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
13.10 Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).”
Figure 4.7 Left Tail vs. Right Tail
Figure 4.8 Two Tail
13.17 Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.575813.15 A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
13.18 A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
13.20 A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
13.12 A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
One tailed-tests are concerned with one side of a statistic. Whereas, Two-tailed tests deal with both tails of the distribution.
Two-tail test is done when you do not know about direction, so you test for both sides.
13.4 \(\text{\{Left Tail or Lower Tail\} } H_0 : \mu \geq \mu_0 \iff H_a: \mu < \mu_0\)
13.5 \(\text{\{Right Tail or Upper Tail\} } H_0 : \mu \leq \mu_0 \iff H_a: \mu > \mu_0\)
13.6 \(\text{\{Two Tail\} } H_0 : \mu = \mu_0 \iff H_a: \mu \neq \mu_0\)
13.14 The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
Steps for the p-value approach or test statistic approach
13.16 The critical value approach requires that we first determine a value for the test statistic called the critical value.
Steps for the critical value approach
If the population standard error (SE) is known, apply z-test. If it is unknown, apply t-test. t-test will converge to z-test with increasing sample size.
Question: Does the probability from t-table differ from the probability value from z-table
It is assumed that \((\overline{x} - \mu)\) follows Normality. However the Standard Error (SE) does not follow normality, generally it follows chi-sq distribution. Thus, \((\overline{x} - \mu)/SE\) becomes ‘Normal/ChiSq’ and this ratio follows the t-distribution. Thus, the test we apply is called t-test.
# #For Degrees of Freedom = 10 (N=11)
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
dof <- 10L
#
# #Left Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.3722 -1.8125 -2.2281 -2.7638 -3.1693
#
# #Right Tail Test
round(qt(p = xxalpha, df = dof, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.3722 1.8125 2.2281 2.7638 3.169312.5 The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
Why \((n-1)\) are the degrees of freedom
Question: Is there any minimum sample size we must consider before calculating degrees of freedom
Guess: Degrees of freedom is also calculated to remove the possible bias
Definitions and Exercises are from the Book (David R. Anderson 2018)
Hence, the total number of data items can be determined by multiplying the number of observations by the number of variables.
For Example, Gender as Male and Female. In cases where the scale of measurement is nominal, a numerical code as well as a nonnumerical label may be used. For example, 1 denotes Male, 2 denotes Female. The scale of measurement is nominal even though the data appear as numerical values. Only Mode can be calculated.
For example, Size as small, medium, large. Along with the labels, similar to nominal data, the data can also be ranked or ordered, which makes the measurement scale ordinal. Ordinal data can also be recorded by a numerical code. Median can be calculated but not the Mean.
Interval data are always numerical. These can be ranked or ordered like ordinal. In addition, the differences between them are meaningful.
Variables such as distance, height, weight, and time use the ratio scale of measurement. This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. Mean can be calculated.
For example, consider the cost of an automobile. A zero value for the cost would indicate that the automobile has no cost and is free. In addition, if we compare the cost of $30,000 for one automobile to the cost of $15,000 for a second automobile, the ratio property shows that the first automobile is $30,000/$15,000 = 2 times, or twice, the cost of the second automobile.
See Table 5.1 for more details.
Interval scale is a measure of continuous quantitative data that has an arbitrary 0 reference point. This is contrasted with ratio scaled data which have a non-arbitrary 0 reference point. Ex: When we look at “profit” we see that negative profit does make sense to us. So while, the 0 for “profit” is meaningful (just like temperature measurements of Celsius) it is arbitrary. Therefore, it is Interval scale of measurement.
In an interval scale, you can take difference of two values. You may not be able to take ratios of two values. Ex: Temperature in Celsius. You can say that if temperatures of two places are 40 °C and 20 °C, then one is hotter than the other (taking difference). But you cannot say that first is twice as hot as the second (not allowed to take ratio).
In a ratio scale, you can take a ratio of two values. Ex: 40 kg is twice as heavy as 20 kg (taking ratios).
Also, “0” on ratio scale means the absence of that physical quantity. “0” on interval scale does not mean the same. 0 kg means the absence of weight. 0 °C does not mean absence of heat.
| Features | Interval scale | Ratio scale |
|---|---|---|
| Variable property | Addition and subtraction | Multiplication and Division i.e. calculate ratios. Thus, you can leverage numbers on the scale against 0. |
| Absolute Point Zero | Zero-point in an interval scale is arbitrary. For example, the temperature can be below 0 °C and into negative temperatures. | The ratio scale has an absolute zero or character of origin. Height and weight cannot be zero or below zero. |
| Calculation | Statistically, in an interval scale, the Arithmetic Mean is calculated. Statistical dispersion permits range and standard deviation. The coefficient of variation is not permitted. | Statistically, in a ratio scale, the Geometric or Harmonic mean is calculated. Also, range and coefficient of variation are permitted for measuring statistical dispersion. |
| Measurement | Interval scale can measure size and magnitude as multiple factors of a defined unit. | Ratio scale can measure size and magnitude as a factor of one defined unit in terms of another. |
| Example | Temperature in Celsius, Calendar years and time, Profit | These possesses an absolute zero characteristic, like age, weight, height, or Sales |
If the variable is categorical, the statistical analysis is limited. We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. However, even when the categorical data are identified by a numerical code, arithmetic operations do not provide meaningful results.
Arithmetic operations provide meaningful results for quantitative variables. For example, quantitative data may be added and then divided by the number of observations to compute the average value.
Quantitative data may be discrete or continuous.
As a result, the data obtained from a well-designed experiment can often provide more information as compared to the data obtained from existing sources or by conducting an observational study.
Refer Sample For More …
The population is the set of entities under study.
Instead, we could take a subset of this population called a sample and use this sample to draw inferences about the population under study, given some conditions.
Whenever statisticians use a sample to estimate a population characteristic of interest, they usually provide a statement of the quality, or precision, associated with the estimate.
Inferential statistics are used for Hypothesis Testing.
The two most common types of Statistical Inference are -
Reasoning for Tests of Significance
Analytics is used for data-driven or fact-based decision making, which is often seen as more objective than alternative approaches to decision making. The tools of analytics can aid decision making by creating insights from data, improving our ability to more accurately forecast for planning, helping us quantify risk, and yielding better alternatives through analysis.
Analytics is now generally thought to comprise three broad categories of techniques. These categories are descriptive analytics, predictive analytics, and prescriptive analytics.
Examples of these types of techniques are data queries, reports, descriptive statistics, data visualization, data dash boards, and basic what-if spreadsheet models.
Linear regression, time series analysis, and forecasting models fall into the category of predictive analytics. Simulation, which is the use of probability and statistical computer models to better understand risk, also falls under the category of predictive analytics.
Prescriptive analytics differs greatly from descriptive or predictive analytics. What distinguishes prescriptive analytics is that prescriptive models yield a best course of action to take. That is, the output of a prescriptive model is a best decision.
Optimization models, which generate solutions that maximize or minimize some objective subject to a set of constraints, fall into the category of prescriptive models.
Volume refers to the amount of available data; velocity refers to the speed at which data is collected and processed; and variety refers to the different data types. The term data warehousing is used to refer to the process of capturing, storing, and maintaining the data.
Data mining relies heavily on statistical methodology such as multiple regression, logistic regression, and correlation.
cor(), lm() etc.xxComputers <- f_getObject("xxComputers", "C01-Computers.csv", "971fb6096e4f71e8185d3327a9033a10")
xxCordless <- f_getObject("xxCordless", "C01-Cordless.csv", "9991f612fe44f1c890440bd238084679")f_getObject <- function(x_name, x_source, x_md = "") {
# #Debugging
a07bug <- FALSE
# #Read the File or Object
# #Ex: xxCars <- f_getObject("xxCars", "S16-cars2.csv", "30051fb47f65810f33cb992015b849cc")
# #tools::md5sum("xx.csv") OR tools::md5sum(paste0(.z$XL, "xx", ".txt"))
#
# #Path to the File
loc_src <- paste0(.z$XL, x_source)
# #Path to the Object
loc_rds <- paste0(.z$XL, x_name, ".rds")
#
# #x_file[1] FILENAME & x_file[2] FILETYPE
x_file <- strsplit(x_source, "[.]")[[1]]
#
if(all(x_md == tools::md5sum(loc_src), file.exists(loc_rds),
file.info(loc_src)$mtime < file.info(loc_rds)$mtime)) {
# #Read RDS if (exists, newer than source, source not modified i.e. passes md5sum)
if(a07bug) print("A07 Flag 01: Reading from RDS")
return(readRDS(loc_rds))
} else if(!file.exists(loc_src)){
message("ERROR: File does not exist! : ", loc_src, "\n")
stop()
} else if(x_file[2] == "csv") {
# #Read CSV as a Tibble
# #col_double(), col_character(), col_logical(), col_integer()
# #DATETIME (EXCEL) "YYYY-MM-DD HH:MM:SS" imported as "UTC"
tbl <- read_csv(loc_src, show_col_types = FALSE)
# #Remove Unncessary Attributes
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
if(a07bug) print("A07 Flag 02: Reading from Source and Saving as RDS")
return(tbl)
} else if(x_file[2] == "xlsx") {
# #Read All Sheets of Excel in a list
tbl <- lapply(excel_sheets(loc_src), read_excel, path = loc_src)
# #Write Object as RDS
saveRDS(tbl, loc_rds)
# #Return Object
return(tbl)
} else {
message("f_getObject(): UNKNOWN")
stop()
}
}bb <- xxComputers
#displ_names <- c("")
#stopifnot(identical(ncol(bb), length(displ_names)))
#
# #Kable Table
kbl(bb,
caption = "(C01T01)",
#col.names = displ_names,
escape = FALSE, align = "c", booktabs = TRUE
) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
html_font = "Consolas", font_size = 12,
full_width = FALSE,
#position = "float_left",
fixed_thead = TRUE
) %>%
# #Header Row Dark & Bold: RGB (48, 48, 48) =HEX (#303030)
row_spec(0, color = "white", background = "#303030", bold = TRUE,
extra_css = "border-bottom: 1px solid; border-top: 1px solid"
)bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
## # A tibble: 12 x 3
## Company Year Values
## <chr> <chr> <dbl>
## 1 Hertz 2007 327
## 2 Hertz 2008 311
## 3 Hertz 2009 286
## 4 Hertz 2010 290
## 5 Dollar 2007 167
## 6 Dollar 2008 140
## 7 Dollar 2009 106
## 8 Dollar 2010 108
## 9 Avis 2007 204
## 10 Avis 2008 220
## 11 Avis 2009 300
## 12 Avis 2010 270# #Transpose
(ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values))
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Equivalent
stopifnot(identical(ii,
bb %>% pivot_longer(-1) %>%
pivot_wider(names_from = 1, values_from = value) %>%
rename(., Year = name)))| SN | tablet | cost | os | display_inch | battery_hh | cpu |
|---|---|---|---|---|---|---|
| 1 | Acer Iconia W510 | 599 | Windows | 10.1 | 8.5 | Intel |
| 2 | Amazon Kindle Fire HD | 299 | Android | 8.9 | 9.0 | TI OMAP |
| 3 | Apple iPad 4 | 499 | iOS | 9.7 | 11.0 | Apple |
| 4 | HP Envy X2 | 860 | Windows | 11.6 | 8.0 | Intel |
| 5 | Lenovo ThinkPad Tablet | 668 | Windows | 10.1 | 10.5 | Intel |
| 6 | Microsoft Surface Pro | 899 | Windows | 10.6 | 4.0 | Intel |
| 7 | Motorola Droid XYboard | 530 | Android | 10.1 | 9.0 | TI OMAP |
| 8 | Samsung Ativ Smart PC | 590 | Windows | 11.6 | 7.0 | Intel |
| 9 | Samsung Galaxy Tab | 525 | Android | 10.1 | 10.0 | Nvidia |
| 10 | Sony Tablet S | 360 | Android | 9.4 | 8.0 | Nvidia |
# #What is the average cost for the tablets #$582.90
cat(paste0("Avg. Cost for the tablets is = $", round(mean(bb$cost), digits = 1), "\n"))
## Avg. Cost for the tablets is = $582.9
#
# #Compare the average cost of tablets with different OS (Windows /Android) #$723.20 $428.5
(ii <- bb %>%
group_by(os) %>%
summarise(Mean = round(mean(cost), digits =1)) %>%
arrange(desc(Mean)) %>%
mutate(Mean = paste0("$", Mean)))
## # A tibble: 3 x 2
## os Mean
## <chr> <chr>
## 1 Windows $723.2
## 2 iOS $499
## 3 Android $428.5
#
cat(paste0("Avg. Cost of Tablets with Windows OS is = ",
ii %>% filter(os == "Windows") %>% select(Mean), "\n"))
## Avg. Cost of Tablets with Windows OS is = $723.2# #What percentage of tablets use an Android operating system #40%
(ii <- bb %>%
group_by(os) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 3 x 2
## os PCT
## <chr> <chr>
## 1 Windows 50%
## 2 Android 40%
## 3 iOS 10%
#
cat(paste0("Android OS is used in ",
ii %>% filter(os == "Android") %>% select(PCT), " Tablets\n"))
## Android OS is used in 40% Tablets| SN | brand | model | price | overall_score | voice_quality | handset_on_base | talk_time_hh |
|---|---|---|---|---|---|---|---|
| 1 | AT&T | CL84100 | 60 | 73 | Excellent | Yes | 7 |
| 2 | AT&T | TL92271 | 80 | 70 | Very Good | No | 7 |
| 3 | Panasonic | 4773B | 100 | 78 | Very Good | Yes | 13 |
| 4 | Panasonic | 6592T | 70 | 72 | Very Good | No | 13 |
| 5 | Uniden | D2997 | 45 | 70 | Very Good | No | 10 |
| 6 | Uniden | D1788 | 80 | 73 | Very Good | Yes | 7 |
| 7 | Vtech | DS6521 | 60 | 72 | Excellent | No | 7 |
| 8 | Vtech | CS6649 | 50 | 72 | Very Good | Yes | 7 |
# #What is the average price for the cordless telephones
cat(paste0("Avg. Price is = $", round(mean(bb$price), digits = 1), "\n"))
## Avg. Price is = $68.1
#
# #What is the average talk time for the cordless telephones
cat(paste0("Avg. Talk Time is = ", round(mean(bb$talk_time_hh), digits = 1), " Hours \n"))
## Avg. Talk Time is = 8.9 Hours# #What percentage of the cordless telephones have a voice quality of excellent
(hh <- bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
mutate(voice_quality = factor(voice_quality,
levels = c("Very Good", "Excellent"), ordered = TRUE)) %>%
arrange(desc(voice_quality)) %>%
mutate(PCT = paste0(PCT, "%")))
## # A tibble: 2 x 2
## voice_quality PCT
## <ord> <chr>
## 1 Excellent 25%
## 2 Very Good 75%
#
cat(paste0("Percentage of 'Excellent' Voice Quality is = ",
hh %>% filter(voice_quality == "Excellent") %>% select(PCT), "\n"))
## Percentage of 'Excellent' Voice Quality is = 25%
#
# #Equivalent
print(bb %>%
group_by(voice_quality) %>%
summarise(PCT = n()) %>%
mutate(PCT = prop.table(PCT) * 100))
## # A tibble: 2 x 2
## voice_quality PCT
## <chr> <dbl>
## 1 Excellent 25
## 2 Very Good 75# #What percentage of the cordless telephones have a handset on the base
bb %>%
group_by(handset_on_base) %>%
summarise(PCT = n()) %>%
mutate(PCT = 100 * PCT / sum(PCT)) %>%
arrange(desc(PCT)) %>%
mutate(PCT = paste0(PCT, "%")) %>%
filter(handset_on_base == "Yes")
## # A tibble: 1 x 2
## handset_on_base PCT
## <chr> <chr>
## 1 Yes 50%
|
|
bb <- tibble(Company = c("Hertz", "Dollar", "Avis"),
`2007` = c(327, 167, 204), `2008` = c(311, 140, 220),
`2009` = c(286, 106, 300), `2010` = c(290, 108, 270))
# #Transpose Tibble: Note that the First Column Header is lost after Transpose
# #Longer
hh <- bb %>% pivot_longer(!Company, names_to = "Year", values_to = "Values")
# #Transpose
ii <- bb %>%
pivot_longer(!Company, names_to = "Year", values_to = "Values") %>%
pivot_wider(names_from = Company, values_from = Values)# #Save an Image
ggsave(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"), plot = C01P01)# #Load an Image
knitr::include_graphics(paste0(.z$PX, "C01P01_Cars_TimeSeries", ".png"))Figure 5.1 Multiple Time Series Graph
# #who appears to be the market share leader
# #how the market shares are changing over time
print(ii)
## # A tibble: 4 x 4
## Year Hertz Dollar Avis
## <chr> <dbl> <dbl> <dbl>
## 1 2007 327 167 204
## 2 2008 311 140 220
## 3 2009 286 106 300
## 4 2010 290 108 270
# #Row Total
jj <- ii %>% rowwise() %>% mutate(SUM = sum(c_across(where(is.numeric)))) %>% ungroup()
kk <- ii %>% mutate(SUM = rowSums(across(where(is.numeric))))
stopifnot(identical(jj, kk))
#
# #Rowwise Percentage Share
ii %>%
rowwise() %>%
mutate(SUM = sum(c_across(where(is.numeric)))) %>%
ungroup() %>%
mutate(across(2:4, ~ round(. * 100 / SUM, digits = 1), .names = "{.col}.{.fn}")) %>%
mutate(across(ends_with(".1"), ~ paste0(., "%")))
## # A tibble: 4 x 8
## Year Hertz Dollar Avis SUM Hertz.1 Dollar.1 Avis.1
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 2007 327 167 204 698 46.8% 23.9% 29.2%
## 2 2008 311 140 220 671 46.3% 20.9% 32.8%
## 3 2009 286 106 300 692 41.3% 15.3% 43.4%
## 4 2010 290 108 270 668 43.4% 16.2% 40.4%# #Bar Plot
aa <- bb %>%
select(Company, `2010`) %>%
rename("Y2010" = `2010`) %>%
arrange(desc(.[2])) %>%
mutate(cSUM = cumsum(Y2010)) %>%
mutate(PCT = 100 * Y2010 / sum(Y2010)) %>%
mutate(cPCT = 100 * cumsum(Y2010) / sum(Y2010)) %>%
mutate(across(Company, factor, levels = unique(Company),ordered = TRUE))
# #
pareto_chr <- setNames(c(aa$Y2010), aa$Company)
stopifnot(identical(pareto_chr, aa %>% pull(Y2010, Company)))
stopifnot(identical(pareto_chr, aa %>% select(1:2) %>% deframe()))
# #Plot Pareto
C01P02 <- pareto.chart(pareto_chr,
xlab = "Company", ylab = "Cars",
# colors of the chart
#col=heat.colors(length(pareto_chr)),
# ranges of the percentages at the right
cumperc = seq(0, 100, by = 20),
# label y right
ylab2 = "Cumulative Percentage",
# title of the chart
main = "Pareto Chart"
)Figure 5.2 Pareto of Cars in 2010
The relative frequency of a class equals the fraction or proportion of observations belonging to a class i.e. it is out of 1 whereas ‘percent frequency’ is out of 100%.
Rather than showing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class.
ggplot() does not allow easy setup of dual axis| softdrink | Frequency | cSUM | PROP | PCT | cPCT |
|---|---|---|---|---|---|
| Coca-Cola | 19 | 19 | 38 | 38% | 38% |
| Pepsi | 13 | 32 | 26 | 26% | 64% |
| Diet Coke | 8 | 40 | 16 | 16% | 80% |
| Dr. Pepper | 5 | 45 | 10 | 10% | 90% |
| Sprite | 5 | 50 | 10 | 10% | 100% |
Figure 6.1 Bar Chart and Pie Chart of Frequency
# #Frequency Distribution
aa <- tibble(softdrink = c("Coca-Cola", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite"),
Frequency = c(19, 8, 5, 13, 5))
#
# #Sort, Cummulative Sum, Percentage, and Cummulative Percentage
bb <- aa %>%
arrange(desc(Frequency)) %>%
mutate(cSUM = cumsum(Frequency)) %>%
mutate(PROP = 100 * Frequency / sum(Frequency)) %>%
mutate(PCT = paste0(PROP, "%")) %>%
mutate(cPCT = paste0(100 * cumsum(Frequency) / sum(Frequency), "%"))# #Sorted Bar Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P01 <- bb %>% mutate(across(softdrink, factor, levels = rev(unique(softdrink)))) %>% {
ggplot(data = ., aes(x = softdrink, y = Frequency)) +
geom_bar(stat = 'identity', aes(fill = softdrink)) +
scale_y_continuous(sec.axis = sec_axis(~ (. / sum(bb$Frequency))*100, name = "Percentages",
labels = function(b) { paste0(round(b, 0), "%")})) +
geom_text(aes(label = paste0(Frequency, "\n(", PCT, ")")), vjust = 2,
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_bar +
labs(x = "Soft Drinks", y = "Frequency", subtitle = NULL,
caption = "C02P01", title = "Bar Chart of Categorical Data")
}# #Pie Chart of Frequencies (Needs x-axis as Factor for proper sorting)
C02P02 <- bb %>% mutate(across(softdrink, factor, levels = unique(softdrink))) %>% {
ggplot(data = ., aes(x = '', y = Frequency, fill = rev(softdrink))) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = "y", start = 0) +
geom_text(aes(label = paste0(softdrink, "\n", Frequency, " (", PCT, ")")),
position = position_stack(vjust = 0.5),
colour = c(rep("black", 2), rep("white", nrow(bb)-2))) +
k_gglayer_pie +
labs(caption = "C02P02", title = "Pie Chart of Categorical Data")
}f_theme_gg <- function(base_size = 14) {
# #Create a Default Theme
theme_bw(base_size = base_size) %+replace%
theme(
# #The whole figure
plot.title = element_text(size = rel(1), face = "bold",
margin = margin(0,0,5,0), hjust = 0),
# #Area where the graph is located
panel.grid.minor = element_blank(),
panel.border = element_blank(),
# #The axes
axis.title = element_text(size = rel(0.85), face = "bold"),
axis.text = element_text(size = rel(0.70), face = "bold"),
# arrow = arrow(length = unit(0.3, "lines"), type = "closed"),
axis.line = element_line(color = "black"),
# The legend
legend.title = element_text(size = rel(0.85), face = "bold"),
legend.text = element_text(size = rel(0.70), face = "bold"),
legend.key = element_rect(fill = "transparent", colour = NA),
legend.key.size = unit(1.5, "lines"),
legend.background = element_rect(fill = "transparent", colour = NA),
# Labels in the case of facetting
strip.background = element_rect(fill = "#17252D", color = "#17252D"),
strip.text = element_text(size = rel(0.85), face = "bold", color = "white", margin = margin(5,0,5,0))
)
}# #Change default ggplot2 theme
theme_set(f_theme_gg())
#
# #List of Specific sets. Note '+' is replaced by ','
k_gglayer_bar <- list(
scale_fill_viridis_d(),
theme(panel.grid.major.x = element_blank(), axis.line = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=1),
legend.position = 'none', axis.title.y.right = element_blank())
)
#
# #Pie
k_gglayer_pie <- list(
scale_fill_viridis_d(),
#theme_void(),
theme(#panel.background = element_rect(fill = "white", colour = "white"),
#plot.background = element_rect(fill = "white",colour = "white"),
axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
#panel.border = element_rect(colour = "black", fill=NA, size=1),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
legend.position = 'none')
)
#
# #Histogram
k_gglayer_hist <- list(
scale_fill_viridis_c(direction = -1, alpha = 0.9),
theme(panel.grid.major.x = element_blank(), axis.line.y = element_blank(),
panel.border = element_blank(), axis.ticks.y = element_blank(),
legend.position = 'none')
)
#
# #Scatter Plot Trendline
k_gglayer_scatter <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.minor = element_blank(),
panel.border = element_blank())
)
#
# #BoxPlot
k_gglayer_box <- list(
scale_fill_viridis_d(alpha = 0.9),
theme(panel.grid.major = element_line(colour = "#d3d3d3"),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(), panel.grid.major.x = element_blank(),
#plot.title = element_text(size = 14, family = "Tahoma", face = "bold"),
#text=element_text(family = "Tahoma"),
#axis.title = element_text(face="bold"),
#axis.text.x = element_text(colour="black", size = 11),
#axis.text.y = element_text(colour="black", size = 9),
axis.line = element_line(size=0.5, colour = "black"))
)Solution: Use geom_bar(stat = "identity")
A histogram is used for continuous data, where the bins represent ranges of data, while a bar chart is a plot of categorical variables.
The three steps necessary to define the classes for a frequency distribution with quantitative data are
set.seed(3)
# #Get Normal Data
bb <- tibble(aa = rnorm(n = 10000)) # #Histogram
# # '..count..' or '..x..'
C02P03 <- bb %>% {
ggplot(data = ., aes(x = aa, fill = ..count..)) +
geom_histogram(bins = 50, position = "identity") +
k_gglayer_hist +
labs(x = "Normal Data", y = "Count", subtitle = paste0("n = ", format(nrow(.), big.mark=",")),
caption = "C02P03", title = "Histogram")
}Figure 6.2 geom_histogram(): Histogram
# #Random Data
aa <- c(26, 35, 22, 47, 37, 5, 50, 49, 42, 2, 8, 7, 4, 47, 44, 35, 17, 49, 1, 48,
1, 27, 13, 26, 18, 44, 31, 4, 23, 47, 38, 28, 28, 5, 35, 39, 29, 13, 17,
38, 1, 8, 3, 30, 18, 37, 29, 39, 7, 28)bb <- tibble(aa)
# #Dot Chart of Frequencies
C02P04 <- bb %>% {
ggplot(., aes(x = aa)) +
geom_dotplot(binwidth = 5, method = "histodot") +
theme(axis.line.y = element_blank(), panel.grid = element_blank(), axis.text.y = element_blank(),
axis.ticks.y = element_blank(), axis.title.y = element_blank()) +
labs(x = "Bins", subtitle = "Caution: Avoid! Y-Axis is deceptive.",
caption = "C02P04", title = "Dot Plot")
}Figure 6.3 geom_dotplot(): Frequency Dot Chart
|
|
|
# #Judges: Because we are evaluating 'Judges', they are the 'elements' and thus are the 'rows'
xxJudges <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
# #Uaggregated crosstab for both Judges in different types of Courts
xxKen <- tibble(Ken = c("Common", "Municipal "),
Upheld = c(90, 20), Reversed = c(10, 5))
xxAbel <- tibble(Abel = c("Common", "Municipal "),
Upheld = c(29, 100), Reversed = c(3, 18))# #Judges
aa <- tibble(Judge_Verdict = c('Abel', 'Ken'), Upheld = c(129, 110), Reversed = c(21, 15))
bb <- tibble(Verdict_Judge = c('Upheld', 'Reversed'), Abel = c(129, 21), Ken = c(110, 15))
aa
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
# #Transpose, Assuming First Column Header has "Row_Col" Type Format
ii <- aa %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(bb, ii))
ii
## # A tibble: 2 x 3
## Verdict_Judge Abel Ken
## <chr> <dbl> <dbl>
## 1 Upheld 129 110
## 2 Reversed 21 15
# #Testing for Reverse
ii <- bb %>%
`attr<-`("ColsLost", unlist(strsplit(names(.)[1], "_"))[1]) %>%
`attr<-`("RowsKept", unlist(strsplit(names(.)[1], "_"))[2]) %>%
pivot_longer(c(-1),
names_to = paste0(attributes(.)$RowsKept, "_", attributes(.)$ColsLost),
values_to = "Values") %>%
pivot_wider(names_from = 1, values_from = Values) %>%
`attr<-`("ColsLost", NULL) %>% `attr<-`("RowsKept", NULL)
stopifnot(identical(aa, ii))bb <- "Judge_Verdict"
# #Split String by strsplit(), output is list
(ii <- unlist(strsplit(bb, "_")))
## [1] "Judge" "Verdict"
#
# #Split on Dot
bb <- "Judge.Verdict"
# #Using character classes
ii <- unlist(strsplit(bb, "[.]"))
# #By escaping special characters
jj <- unlist(strsplit(bb, "\\."))
# #Using Options
kk <- unlist(strsplit(bb, ".", fixed = TRUE))
stopifnot(all(identical(ii, jj), identical(ii, kk)))jj <- ii <- bb <- aa
# #attr() adds or removes an attribute
attr(bb, "NewOne") <- "abc"
# #Using Backticks
ii <- `attr<-`(ii, "NewOne", "abc")
# #Using Pipe
jj <- jj %>% `attr<-`("NewOne", "abc")
#
stopifnot(all(identical(bb, ii), identical(bb, jj)))
#
# #List Attributes
names(attributes(bb))
## [1] "class" "row.names" "names" "NewOne"
#
# #Specific Attribute Value
attributes(bb)$NewOne
## [1] "abc"
#
# #Remove Attributes
attr(bb, "NewOne") <- NULL
ii <- `attr<-`(ii, "NewOne", NULL)
jj <- jj %>% `attr<-`("NewOne", NULL)
stopifnot(all(identical(bb, ii), identical(bb, jj)))# #(Deprecated) Issues:
# #(1) bind_rows() needs two dataframes. Thus, first can be skipped in Pipe, But...
# #The second dataframe can not be replaced with dot (.), it has to have a name
# #(2) Pipe usage inside function call was working but was a concern
# #(3) It introduced NA for that replace was needed as another step
ii <- aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, ~ replace(., . %in% NA, "Total")))
#
# #(Deprecated) Works but needs ALL Column Names individually
jj <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.[, 2]), Reversed = sum(.[, 3]))
kk <- aa %>% add_row(Judge_Verdict = "Total", Upheld = sum(.$Upheld), Reversed = sum(.$Reversed))
#
# #(Deprecated) Removed the Multiple call to sum(). However, it needs First Column Header Name
ll <- aa %>% add_row(Judge_Verdict = "Total", summarise(., across(where(is.numeric), sum)))
# #(Deprecated) Replaced Column Header Name using "Tilde"
mm <- aa %>% add_row(summarise(., across(where(is.character), ~"Total")),
summarise(., across(where(is.numeric), sum)))
stopifnot(all(identical(ii, jj), identical(ii, kk), identical(ii, ll), identical(ii, mm)))
#
# #(Working): Minimised
aa %>% add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.numeric), sum)))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# # USE '%in%' for NA, otherwise '==' works
aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, ~ replace(., . %in% NA, "Total")))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <chr> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36
#
# #Replace NA in a Factor
aa %>% bind_rows(aa %>% summarise(across(where(is.numeric), sum))) %>%
mutate(Judge_Verdict = factor(Judge_Verdict)) %>%
mutate(across(1, fct_explicit_na, na_level = "Total"))
## # A tibble: 3 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15
## 3 Total 239 36# #Convert to Factor
aa %>% mutate(Judge_Verdict = factor(Judge_Verdict))
## # A tibble: 2 x 3
## Judge_Verdict Upheld Reversed
## <fct> <dbl> <dbl>
## 1 Abel 129 21
## 2 Ken 110 15# #Paste but do not execute
aa <- read_delim(clipboard())
# #Copy Excel Data, then execute the above command
#
# #Print its structure
dput(aa)
# #Copy the relevant values, headers in tibble()
bb <- tibble( )
# #The above command will be the setup to generate this tibble anywhereex27 <- tibble(Observation = 1:30,
x = c("A", "B", "B", "C", "B", "C", "B", "C", "A", "B", "A", "B", "C", "C", "C",
"B", "C", "B", "C", "B", "C", "B", "C", "A", "B", "C", "C", "A", "B", "B"),
y = c(1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2,
2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2))bb <- ex27
str(bb)
## tibble [30 x 3] (S3: tbl_df/tbl/data.frame)
## $ Observation: int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
## $ x : chr [1:30] "A" "B" "B" "C" ...
## $ y : num [1:30] 1 1 1 2 1 2 1 2 1 1 ...
# #Create CrossTab
bb <- bb %>%
count(x, y) %>%
pivot_wider(names_from = y, values_from = n, values_fill = 0)bb
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <int> <int>
## 1 A 5 0
## 2 B 11 2
## 3 C 2 10
# #Rowwise Percentage in Separate New Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1), .names = "{.col}_Row" ))
## # A tibble: 3 x 7
## x `1` `2` SUM `1_Row` `2_Row` SUM_Row
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 A 5 0 5 100 0 100
## 2 B 11 2 13 84.6 15.4 100
## 3 C 2 10 12 16.7 83.3 100
#
# #Rowwise Percentage in Same Columns
bb %>%
mutate(SUM = rowSums(across(where(is.numeric)))) %>%
mutate(across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Equivalent
bb %>%
mutate(SUM = rowSums(across(where(is.numeric))),
across(where(is.numeric), ~ round(. * 100 /SUM, 1)))
## # A tibble: 3 x 4
## x `1` `2` SUM
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 0 100
## 2 B 84.6 15.4 100
## 3 C 16.7 83.3 100
#
# #Columnwise Percentage in Separate New Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1), .names = "{.col}_Col" ))
## # A tibble: 3 x 5
## x `1` `2` `1_Col` `2_Col`
## <chr> <int> <int> <dbl> <dbl>
## 1 A 5 0 27.8 0
## 2 B 11 2 61.1 16.7
## 3 C 2 10 11.1 83.3
# #Columnwise Percentage in Same Columns
bb %>%
mutate(across(where(is.numeric), ~ round(. * 100 /sum(.), 1)))
## # A tibble: 3 x 3
## x `1` `2`
## <chr> <dbl> <dbl>
## 1 A 27.8 0
## 2 B 61.1 16.7
## 3 C 11.1 83.3ex28 <- tibble(Observation = 1:20,
x = c(28, 17, 52, 79, 37, 71, 37, 27, 64, 53, 13, 84, 59, 17, 70, 47, 35, 62, 30, 43),
y = c(72, 99, 58, 34, 60, 22, 77, 85, 45, 47, 98, 21, 32, 81, 34, 64, 68, 67, 39, 28))bb <- ex28
# #Rounding to the lowest 10s before min and to the highest 10s after max
nn <- 10L
n_x <- seq(floor(min(bb$x) / nn) * nn, ceiling(max(bb$x) / nn) * nn, by = 20)
n_y <- seq(floor(min(bb$y) / nn) * nn, ceiling(max(bb$y) / nn) * nn, by = 20)
#
# #Labels in the format of [10-29]
lab_x <- paste0(n_x, "-", n_x +20 -1) %>% head(-1)
lab_y <- paste0(n_y, "-", n_y +20 -1) %>% head(-1)
# #Wider Table without Totals
ii <- bb %>%
mutate(x_bins = cut(x, breaks = n_x, right = FALSE, labels = lab_x),
y_bins = cut(y, breaks = n_y, right = FALSE, labels = lab_y)) %>%
count(x_bins, y_bins) %>%
pivot_wider(names_from = y_bins, values_from = n, values_fill = 0, names_sort = TRUE)
print(ii)
## # A tibble: 4 x 5
## x_bins `20-39` `40-59` `60-79` `80-99`
## <fct> <int> <int> <int> <int>
## 1 10-29 0 0 1 4
## 2 30-49 2 0 4 0
## 3 50-69 1 3 1 0
## 4 70-89 4 0 0 0
# #Cross Tab with Total Column and Total Row
jj <- ii %>%
bind_rows(ii %>% summarise(across(where(is.numeric), sum))) %>%
mutate(across(1, fct_explicit_na, na_level = "Total")) %>%
mutate(SUM = rowSums(across(where(is.numeric))))
print(jj)
## # A tibble: 5 x 6
## x_bins `20-39` `40-59` `60-79` `80-99` SUM
## <fct> <int> <int> <int> <int> <dbl>
## 1 10-29 0 0 1 4 5
## 2 30-49 2 0 4 0 6
## 3 50-69 1 3 1 0 5
## 4 70-89 4 0 0 0 4
## 5 Total 7 3 6 4 20# #Group Continuous Data to Categorical Bins by base::cut()
bb <- ex28
#
# #NOTE cut() increases the range slightly but ggplot functions don't
bb %>% mutate(x_bins = cut(x, breaks = 8)) %>%
pull(x_bins) %>% levels()
## [1] "(12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #By default, it excludes the lower range, but it can be included by option
bb %>% mutate(x_bins = cut(x, breaks = 8, include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
## [1] "[12.9,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84.1]"
#
# #ggplot::cut_interval() makes n groups with equal range. There is a cut_number() also
bb %>% mutate(x_bins = cut_interval(x, n = 8)) %>%
pull(x_bins) %>% levels()
## [1] "[13,21.9]" "(21.9,30.8]" "(30.8,39.6]" "(39.6,48.5]" "(48.5,57.4]" "(57.4,66.2]"
## [7] "(66.2,75.1]" "(75.1,84]"
#
# #Specific Bins
bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10))) %>%
pull(x_bins) %>% levels()
## [1] "(10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
ii <- bb %>% mutate(x_bins = cut(x, breaks = seq(10, 90, by = 10), include.lowest = TRUE)) %>%
pull(x_bins) %>% levels()
print(ii)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
#
# #ggplot::cut_width() makes groups of width
bb %>% mutate(x_bins = cut_width(x, width = 10)) %>%
pull(x_bins) %>% levels()
## [1] "[5,15]" "(15,25]" "(25,35]" "(35,45]" "(45,55]" "(55,65]" "(65,75]" "(75,85]"
#
# #Match cut_width() and cut()
jj <- bb %>% mutate(x_bins = cut_width(x, width = 10, boundary = 0)) %>%
pull(x_bins) %>% levels()
print(jj)
## [1] "[10,20]" "(20,30]" "(30,40]" "(40,50]" "(50,60]" "(60,70]" "(70,80]" "(80,90]"
stopifnot(identical(ii, jj))
#
# #Labelling
n_breaks <- seq(10, 90, by = 10)
n_labs <- paste0("*", n_breaks, "-", n_breaks + 10) %>% head(-1)
bb %>% mutate(x_bins = cut(x, breaks = n_breaks, include.lowest = TRUE, labels = n_labs)) %>%
pull(x_bins) %>% levels()
## [1] "*10-20" "*20-30" "*30-40" "*40-50" "*50-60" "*60-70" "*70-80" "*80-90"xxCommercials <- tibble(Week = 1:10,
Commercials = c(2, 5, 1, 3, 4, 1, 5, 3, 4, 2),
Sales = c(50, 57, 41, 54, 54, 38, 63, 48, 59, 46))
f_setRDS(xxCommercials)Figure 6.4 geom_point(), geom_smooth(), & stat_poly_eq()
bb <- xxCommercials
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline alongwith its equation and R2 value
C02P05 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_point() +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = paste0("Trendline equation and R", '\u00b2', " value"),
caption = "C02P05", title = "Scatter Plot")
}Individual numbers can be represented by symbols, called numerals; for example, “5” is a numeral that represents the ‘number five.’
As only a relatively small number of symbols can be memorized, basic numerals are commonly organized in a numeral system, which is an organized way to represent any number. The most common numeral system is the Hindu–Arabic numeral system, which allows for the representation of any number using a combination of ten fundamental numeric symbols, called digits.
Counting is the process of determining the number of elements of a finite set of objects, i.e., determining the size of a set. Enumeration refers to uniquely identifying the elements of a set by assigning a number to each element.
Measurement is the quantification of attributes of an object or event, which can be used to compare with other objects or events.
Formally, \(\mathbb{N} \to \mathbb{Z} \to \mathbb{Q} \to \mathbb{R} \to \mathbb{C}\)
Practically, \(\mathbb{N} \subset \mathbb{Z} \subset \mathbb{Q} \subset \mathbb{R} \subset \mathbb{C}\)
The natural numbers \(\mathbb{N}\) are those numbers used for counting and ordering. ISO standard begin the natural numbers with 0, corresponding to the non-negative integers \(\mathbb{N} = \{0, 1, 2, 3, \ldots \}\), whereas others start with 1, corresponding to the positive integers \(\mathbb{N^*} = \{1, 2, 3, \ldots \}\)
The set of integers \(\mathbb{Z}\) consists of zero (\({0}\)), the positive natural numbers \(\{1, 2, 3, \ldots \}\) and their additive inverses (the negative integers). Thus i.e., \(\mathbb{Z} = \{\ldots, -3, -2, -1, 0, 1, 2, 3, \ldots \}\). An integer is colloquially defined as a number that can be written without a fractional component.
Rational numbers \(\mathbb{Q}\) are those which can be expressed as the quotient or fraction p/q of two integers, a numerator p and a non-zero denominator q. Thus, Rational Numbers \(\mathbb{Q} = \{1, 2, 3, \ldots \}\)
A real number is a value of a continuous quantity that can represent a distance along a line. The real numbers include all the rational numbers \(\mathbb{Q}\), and all the irrational numbers. Thus, Real Numbers \(\mathbb{R} = \mathbb{Q} \cup \{\sqrt{2}, \sqrt{3}, \ldots\} \cup \{ \pi, e, \phi, \ldots \}\)
The complex numbers \(\mathbb{C}\) contain numbers which are expressed in the form \(a + ib\), where \({a}\) and \({b}\) are real numbers. These have two components the real numbers and a specific element denoted by \({i}\) (imaginary unit) which satisfies the equation \(i^2 = −1\).
The number Pi \(\pi = 3.14159\ldots\) is defined as the ratio of circumference of a circle to its diameter.
\[\pi = \int _{-1}^{1} \frac{dx}{\sqrt{1- x^2}} \tag{7.1}\]
\[e^{i\varphi}=\cos \varphi +i\sin \varphi \tag{7.2}\]
\[e^{i\pi} + 1 =0 \tag{7.3}\]
# #Read OIS File for 20000 PI digits including integral (3) and fractional (14159...)
# #md5sum = "daf0b33a67fd842a905bb577957a9c7f"
tbl <- read_delim(file = paste0(.z$XL, "PI-OIS-b000796.txt"),
delim = " ", col_names = c("POS", "VAL"), col_types = list(POS = "i", VAL = "i"))
attr(tbl, "spec") <- NULL
attr(tbl, "problems") <- NULL
xxPI <- tbl
f_setRDS(xxPI)Euler Number \(e = 2.71828\ldots\), is the base of the natural logarithm.
\[e = \lim_{n \to \infty} \left(1 + \frac{1}{n} \right)^{n} = \sum \limits_{n=0}^{\infty} \frac{1}{n!} \tag{7.4}\]
Two quantities are in the golden ratio \(\varphi = 1.618\ldots\) if their ratio is the same as the ratio of their sum to the larger of the two quantities.
\[\varphi^2 - \varphi -1 =0 \\ \varphi = \frac{1+\sqrt{5}}{2} \tag{7.5}\]
Mersenne primes:
# #Create empty Vector with NA
aa <- rep(NA_integer_, 10)
print(aa)
## [1] NA NA NA NA NA NA NA NA NA NAf_isPrime <- function(x) {
# #Check if the number is Prime
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647L) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}
# #Vectorise Version
f_isPrimeV <- Vectorize(f_isPrime)
# #Compiled Version
f_isPrimeC <- cmpfun(f_isPrime)# #There are 4 Primes in First 10, 25 in 100, 168 in 1000, 1229 in 10000.
# # Using Vectorise Version, get all the Primes
aa <- 1:10
bb <- aa[f_isPrimeV(aa)]
ii <- f_getPrimeUpto(10)
stopifnot(identical(bb, ii))
# #
xxPrime10 <- c(2, 3, 5, 7) |> as.integer()
# #
xxPrime100 <- c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47,
53, 59, 61, 67, 71, 73, 79, 83, 89, 97) |> as.integer()
#
# #Generate List of ALL Primes till 524287 (i.e. Total 43,390 Primes)
xxPrimes <- f_getPrimeUpto(524287L)
# #Save as RDS
f_setRDS(xxPrimes)# #NOTE: Assigning 2147483647L causes the Chunk to throw Warnings even with 'eval=FALSE'.
if(FALSE){
# #Assignment of 2305843009213693951L is NOT possible without Warning
# #Even within non-executing Block or with 'eval=FALSE' or suppressWarnings() or tryCatch()
# #It cannot be stored as integer, thus it is automatically converted to double
#bb <- 2305843009213693951L
# #Warning: non-integer value 2305843009213693951L qualified with L; using numeric value
# #NOTE that the value changed. It is explicitly NOT a prime anymore.
#print(aa, digits = 20)
# #[1] 2305843009213693952
#
# #Assignment of 2147483647L is possible and direct printing in console works BUT
# #Its printing will also throw Warnings that are difficult to handle
# #Avoid Printing. Even within non-executing Block, it is affecting R Bookdown.
aa <- 2147483647L
#print(aa)
}f_getPrimeUpto <- function(x){
# #Get a Vector of Primes upto the given Number (Max. 524287)
if(x < 2) {
print("NOT ALLOWED!")
return(NULL)
} else if(x > 524287){
print("Sadly, beyond this number it is difficult to generate the List of Primes!")
return(NULL)
}
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
return(y)
}# #Compare any number of functions
result <- microbenchmark(
sum(1:100)/length(1:100),
mean(1:100),
#times = 1000,
check = 'identical'
)
# #Print Table
print(result)
##Unit: microseconds
## expr min lq mean median uq max neval cld
## sum(1:100)/length(1:100) 1.2 1.301 1.54795 1.5005 1.6005 7.501 100 a
## mean(1:100) 5.9 6.001 6.56989 6.1010 6.2010 28.001 100 b
#
# #Boxplot of Benchmarking Result
#autoplot(result)
# #Above testcase showed a surprising result of sum()/length() being much faster than mean()“ForLater” - Include rowsum(), rowSums(), colSums(), rowMeans(), colMeans() in this also.
# #Conclusion: use mean() because precision is difficult to achieve compared to speed
#
# #sum()/length() is faster than mean()
# #However, mean() does double pass, so it would be more accurate
# #mean.default() and var() compute means with an additional pass and so are more accurate
# #e.g. the variance of a constant vector is (almost) always zero
# #and the mean of such a vector will be equal to the constant value to machine precision.
aa <- 1:100
#
microbenchmark(
sum(aa)/length(aa),
mean(aa),
mean.default(aa),
.Internal(mean(aa)),
#times = 1000,
check = 'identical'
)
## Unit: nanoseconds
## expr min lq mean median uq max neval cld
## sum(aa)/length(aa) 900 1000 1275 1100 1300 7200 100 a
## mean(aa) 5300 5500 7096 5700 6700 23800 100 c
## mean.default(aa) 2900 3100 3987 3200 3700 18400 100 b
## .Internal(mean(aa)) 1100 1200 1475 1200 1300 9700 100 a
# #rnorm() generates random deviates of given length
set.seed(3)
aa <- rnorm(1e7)
str(aa)
## num [1:10000000] -0.962 -0.293 0.259 -1.152 0.196 ...
#
# #NOTE manual calculation and mean() is NOT matching
identical(sum(aa)/length(aa), mean(aa))
## [1] FALSE
#
# #There is a slight difference
sum(aa)/length(aa) - mean(aa)
## [1] 2.355429e-17if(FALSE) {
# #Remove all objects matching a pattern
rm(list = ls(pattern = "f_"))
}# #Check the Current Options Value
getOption("expressions")
## [1] 5000
if(FALSE) {
# #Change Value
# #NOTE it didn't help when recursive function failed
# #Error: node stack overflow
# #Error during wrapup: node stack overflow
# #Error: no more error handlers available ...
options(expressions=10000)
}# #To Vectorise a Function
f_isPrimeV <- Vectorize(f_isPrime)# #To Pre-Compile a Function for faster performance
f_isPrimeC <- cmpfun(f_isPrime)# #To Profile a Function Calls for improvements
Rprof("file.out")
f_isPrime(2147483647L)
#f_getPrimesUpto(131071L)
Rprof(NULL)
summaryRprof("file.out")# #Functions to check for PRIME - All of them have various problems
# #"-3L -2L -1L 0L 1L 8L" FALSE "2L 3L ... 524287L 2147483647L" TRUE
isPrime_a <- function(x) {
# #Fails for "2147483647L" Error: cannot allocate vector of size 8.0 Gb
if (x == 2L) {
return(TRUE)
} else if (any(x %% 2:(x-1) == 0)) {
return(FALSE)
} else return(TRUE)
}
isPrime_b <- function(x){
# #Comparison of Division and Integer Division by 1,2,...,x
# #Fails for "2147483647L" Error: cannot allocate vector of size 16.0 Gb
# #Fails for "-ve and zero" Error: missing value where TRUE/FALSE needed
# vapply(x, function(y) sum(y / 1:y == y %/% 1:y), integer(1L)) == 2L
if(sum(x / 1:x == x %/% 1:x) == 2) {
return(TRUE)
} else return(FALSE)
}
isPrime_c <- function(x) {
# #RegEx Slowest: Iit converts -ve values and coerce non-integers which may result in bugs
x <- abs(as.integer(x))
if(x > 8191L) {
print("Do not run this with large values. RegEx is really slow.")
stop()
}
!grepl('^1?$|^(11+?)\\1+$', strrep('1', x))
}
isPrime_d <- function(x) {
# #Fails for "1" & returns TRUE
# #Fails for "-ve and zero" Error: NA/NaN argument
if(x == 2L || all(x %% 2L:max(2, floor(sqrt(x))) != 0)) {
return(TRUE)
} else return(FALSE)
}
isPrime_e <- function(x) {
# #Fails for "-ve and zero" Error: NA/NaN argument
# #This is the most robust which can be improved by conditional check for positive integers
# #However, this checks the number against ALL Smaller values including non-primes
if(x == 2L || all(x %% 2L:ceiling(sqrt(x)) != 0)) {
# # "seq.int(3, ceiling(sqrt(x)), 2)" is slower
return(TRUE)
} else {
## (any(x %% 2L:ceiling(sqrt(x)) == 0))
## (any(x %% seq.int(3, ceiling(sqrt(x)), 2) == 0))
## NOTE Further, if sequence starts from 3, add 2 also as a Prime Number
return(FALSE)
}
}# #131071 (12,251th), 524287 (43,390th), 2147483647 (105,097,565th)
aa <- 1:131071
# #Following works but only till 524287L, Memory Overflow ERROR for 2147483647L
bb <- aa[f_isPrimeV(aa)]
getPrimeUpto_a <- function(x){
# #Extremely Slow, Can't go beyond 8191L in benchmark testing
if(x < 2) return("ERROR")
y <- 2:x
primes <- rep(2L, x)
j <- 1L
for (i in y) {
if (!any(i %% primes == 0)) {
j <- j + 1L
primes[j] <- i
#cat(paste0("i=", i, ", j=", j, ", Primes= ",paste0(head(primes, j), collapse = ", ")))
}
#cat("\n")
}
result <- head(primes, j)
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_b <- function(x){
# #https://stackoverflow.com/questions/3789968/
# #This is much faster even from the "aa[f_isPrimeV(aa)]"
if(x < 2) return("ERROR")
y <- 2:x
i <- 1
while (y[i] <= sqrt(x)) {
y <- y[y %% y[i] != 0 | y == y[i]]
i <- i+1
}
result <- y
#str(result)
#cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
#cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_c <- function(x) {
# #Problems and Slow
# #Returns a Vetor of Primes till the Number i.e. f_getPrimesUpto(7) = (2, 3, 5, 7)
# #NOTE: f_getPrimesUpto(1) and f_getPrimesUpto(2) both return "2"
if(!is.integer(x)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(x))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(x <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(x > 2147483647) {
cat(paste0("Doubles are stored as approximation. Prime will not be calculated for value higher than '2147483647' \n"))
stop()
}
# #Can't create vector of length 2147483647L and also not needed that many
# #ceiling(sqrt(7L)) return 3, however we need length 4 (2, 3, 5, 7)
# #So, added extra 10
#primes <- rep(NA_integer_, 10L + sqrt(2L))
primes <- rep(2L, 10L + sqrt(2L))
j <- 1L
primes[j] <- 2L
#
i <- 2L
while(i <= x) {
# #na.omit() was the slowest step, so changed all NA to 2L in the primes
#k <- na.omit(primes[primes <= ceiling(sqrt(i))])
k <- primes[primes <= ceiling(sqrt(i))]
if(all(as.logical(i %% k))) {
j <- j + 1
primes[j] <- i
}
# #Increment with INTEGER Addition
i = i + 1L
}
result <- primes[complete.cases(primes)]
str(result)
cat(paste0("Head: ", paste0(head(result), collapse = ", "), "\n"))
cat(paste0("Tail: ", paste0(tail(result), collapse = ", "), "\n"))
return(result)
}
getPrimeUpto_d <- function(n = 10L, i = 2L, primes = c(2L), bypass = TRUE){
# #Using Recursion is NOT a good solution
# #Function to return N Primes upto 1000 Primes (7919) or Max Value reaching 10000.
if(i > 10000){
cat("Reached 10000 \n")
return(primes)
}
if(bypass) {
maxN <- 1000L
if(!is.integer(n)) {
cat("Error! Integer required. \n")
stop()
} else if(!identical(1L, length(n))) {
cat("Error! Unit length vector required. \n")
stop()
} else if(n <= 0L) {
cat("Error! Positive Integer required. \n")
stop()
} else if(n > maxN) {
cat(paste0("Error! This will calculate only upto ", maxN, " prime Numebers. \n"))
stop()
}
}
if(length(primes) < n) {
if(all(as.logical(i %% primes[primes <= ceiling(sqrt(i))]))) {
# #Coercing 0 to FALSE, Non-zero Values to TRUE
# # "i %% 2L:ceiling(sqrt(i))" checks i agains all integers till sqrt(i)
# # "primes[primes <= ceiling(sqrt(i))]" checks i against only the primes till sqrt(i)
# #However, the above needs hardcoded 2L as prime so the vector is never empty
# #Current Number is Prime, so include it in the vector and check the successive one
f_getPrime(n, i = i+1, primes = c(primes, i), bypass = FALSE)
} else {
# #Current Number is NOT Prime, so check the successive one
f_getPrime(n, i = i+1, primes = primes, bypass = FALSE)
}
} else {
# #Return the vector when it reaches the count
return(primes)
}
}\[\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n} \tag{7.6}\]
In the mean calculation, normally each \({x_i}\) is given equal importance or weightage of \({1/n}\). However, in some instances the mean is computed by giving each observation a weight that reflects its relative importance. A mean computed in this manner is referred to as the weighted mean, as given in equation (7.7)
\[\bar{x} = \frac{\sum_{i=1}^n{w_ix_i}}{\sum_{i=1}^n{w_i}} \tag{7.7}\]
Caution: Unit of mean is same as unit of the variable e.g. cost_per_kg thus ‘w’ would be ‘kg.’
aa <- 1:10
# #Mean of First 10 Numbers
mean(aa)
## [1] 5.5aa <- 1:10
# #Mean of First 10 Numbers
ii <- mean(aa)
print(ii)
## [1] 5.5
jj <- sum(aa)/length(aa)
stopifnot(identical(ii, jj))
#
# #Mean of First 10 Prime Numbers (is neither Prime nor Integer)
mean(f_getRDS(xxPrimes)[1:10])
## [1] 12.9
#
# #Mean of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% mean()
## [1] 4.71aa <- tibble(Purchase = 1:5, cost_per_kg = c(3, 3.4, 2.8, 2.9, 3.25),
kg = c(1200, 500, 2750, 1000, 800))
# #NOTE that unit of mean is same as unit of the variable e.g. cost_per_kg thus 'w' would be 'kg'
(ii <- sum(aa$cost_per_kg * aa$kg)/sum(aa$kg))
## [1] 2.96
jj <- with(aa, sum(cost_per_kg * kg)/sum(kg))
kk <- weighted.mean(x = aa$cost_per_kg, w = aa$kg)
stopifnot(all(identical(ii, jj), identical(ii, kk)))\[\begin{align} \text{if n is odd, } median(x) & = x_{(n+1)/2} \\ \text{if n is even, } median(x) & = \frac{x_{(n/2)}+x_{(n/2)+1}}{2} \end{align} \tag{7.8}\]
aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5aa <- 1:10
# #Median of First 10 Numbers
median(aa)
## [1] 5.5
#
# #Median of First 10 Prime Numbers (is NOT prime)
median(f_getRDS(xxPrimes)[1:10])
## [1] 12
#
# #Median of First 100 Digits of PI
f_getRDS(xxPI)[1:100, ] %>% pull(VAL) %>% median()
## [1] 4.5\[\overline{x}_g = \left(\prod _{i=1}^{n} x_i\right)^{\frac{1}{n}} = \sqrt[{n}]{x_1 x_2 \ldots x_n} \tag{7.9}\]
aa <- 1:10
# #Geometric Mean of of First 10 Numbers
exp(mean(log(aa)))
## [1] 4.528729aa <- 1:10
# #Geometric Mean of of First 10 Numbers
ii <- exp(mean(log(aa)))
jj <- prod(aa)^(1/length(aa))
stopifnot(identical(ii, jj))
#
# #Geometric Mean of First 10 Prime Numbers
exp(mean(log(f_getRDS(xxPrimes)[1:10])))
## [1] 9.573889# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ] %>% pull(VAL)
f_getMode(bb)
## [1] 9# #Mode of First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #Get Frequency
bb %>% count(VAL)
## # A tibble: 10 x 2
## VAL n
## <int> <int>
## 1 0 8
## 2 1 8
## 3 2 12
## 4 3 12
## 5 4 10
## 6 5 8
## 7 6 9
## 8 7 8
## 9 8 12
## 10 9 13
#
# #Get Mode
bb %>% pull(VAL) %>% f_getMode()
## [1] 9f_getMode <- function(x) {
# #Calculate Statistical Mode
# #NOTE: Single Length, All NA, Characters etc. have NOT been validated
# #https://stackoverflow.com/questions/56552709
# #https://stackoverflow.com/questions/2547402
# #Remove NA
if (anyNA(x)) {
x <- x[!is.na(x)]
}
# #Get Unique Values
ux <- unique(x)
# #Match
ux[which.max(tabulate(match(x, ux)))]
}type =6 option of quantile(), default is type =7\[L_p = \frac{p}{100}(n+1) \tag{7.10}\]
# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
quantile(bb$VAL, 0.5)
## 50%
## 4.5# #First 100 Digits of PI
bb <- f_getRDS(xxPI)[1:100, ]
#
# #50% Percentile of Digits i.e. Median
ii <- quantile(bb$VAL, 0.5)
print(ii)
## 50%
## 4.5
jj <- median(bb$VAL)
stopifnot(identical(unname(ii), jj))
#
# #All Quartiles
quantile(bb$VAL, seq(0, 1, 0.25))
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.25 9.00
# #summary()
summary(bb$VAL)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 2.00 4.50 4.71 7.25 9.00
#
# #To Match with Excel "PERCENTILE.EXC" use type=6 in place of default type=7
quantile(bb$VAL, seq(0, 1, 0.25), type = 6)
## 0% 25% 50% 75% 100%
## 0.00 2.00 4.50 7.75 9.00In addition to measures of location, it is often desirable to consider measures of variability, or dispersion.
range()
max() - min()IQR()
\[\begin{align} \sigma^2 &= \frac{1}{n} \sum _{i=1}^{n} \left(x_i - \mu \right)^2 \\ s^2 &= \frac{1}{n-1} \sum _{i=1}^{n} \left(x_i - \overline{x} \right)^2 \end{align} \tag{7.11}\]
\[\begin{align} \sigma &= \sqrt{\frac{1}{N} \sum_{i=1}^N \left(x_i - \mu\right)^2} \\ {s} &= \sqrt{\frac{1}{N-1} \sum_{i=1}^N \left(x_i - \bar{x}\right)^2} \end{align} \tag{7.12}\]
Refer figure 7.1
Skewness is given by the equation (7.13), which is being shown here because it looked cool has deep meaning
\[Skew = \frac{\tfrac {1}{n}\sum_{i=1}^{n}(x_{i}-{\overline {x}})^{3}}{\left[\tfrac {1}{n-1}\sum_{i=1}^{n}(x_{i}-{\overline {x}})^{2} \right]^{3/2}} \tag{7.13}\]
Figure 7.1 (Left Tail, Negative) Beta, Normal Distribution, Exponential (Positive, Right Tail)
# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
even_skew <- c(49, 50, 51)
pos_skew <- c(even_skew, 60)
neg_skew <- c(even_skew, 40)
skew_lst <- list(even_skew, pos_skew, neg_skew)
# #Mean, Median, SD
cat(paste0("Mean (neg, even, pos): ",
paste0(vapply(skew_lst, mean, numeric(1)), collapse = ", "), "\n"))
## Mean (neg, even, pos): 50, 52.5, 47.5
cat(paste0("Median (neg, even, pos): ",
paste0(vapply(skew_lst, median, numeric(1)), collapse = ", "), "\n"))
## Median (neg, even, pos): 50, 50.5, 49.5
cat(paste0("SD (neg, even, pos): ", paste0(
round(vapply(skew_lst, sd, numeric(1)), 1), collapse = ", "), "\n"))
## SD (neg, even, pos): 1, 5.1, 5.1
#
cat(paste0("Skewness (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::skewness, numeric(1)), 1), collapse = ", "), "\n"))
## Skewness (neg, even, pos): 0, 0.7, -0.7
cat(paste0("Kurtosis (neg, even, pos): ", paste0(
round(vapply(skew_lst, e1071::kurtosis, numeric(1)), 1), collapse = ", "), "\n"))
## Kurtosis (neg, even, pos): -2.3, -1.7, -1.7# #Skewness Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Skewness: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Skewness "e1071" has Type = 3 as default. Its Type = 1 matches "moments"
# #Practically, Normal has (small) NON-Zero Positive Skewness
skew_e_t3 <- vapply(dis_lst, e1071::skewness, numeric(1))
skew_e_t2 <- vapply(dis_lst, e1071::skewness, type = 2, numeric(1))
skew_e_t1 <- vapply(dis_lst, e1071::skewness, type = 1, numeric(1))
skew_mmt <- vapply(dis_lst, moments::skewness, numeric(1))
stopifnot(identical(round(skew_e_t1, 10), round(skew_mmt, 10)))
cat(paste0("e1071: Type = 1 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Skewness (Normal, Exp, Beta): 0.0407, 2.0573, -0.6279
cat(paste0("e1071: Type = 2 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Skewness (Normal, Exp, Beta): 0.0407, 2.0576, -0.628
cat(paste0("e1071: Type = 3 Skewness (Normal, Exp, Beta): ",
paste0(round(skew_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Skewness (Normal, Exp, Beta): 0.0407, 2.057, -0.6278
#
# #Formula: (sigma_ (x_i - mu)^3) /(n * sd^3)
bb <- xxNormal
skew_man <- sum({bb - mean(bb)}^3) / {length(bb) * sd(bb)^3}
cat(paste0("(Manual) Skewness of Normal: ", round(skew_man, 4),
" (vs. e1071 Type 3 = ", round(skew_e_t3[1], 4), ") \n"))
## (Manual) Skewness of Normal: 0.0407 (vs. e1071 Type 3 = 0.0407)set.seed(3)
nn <- 10000L
# #Normal distribution is symmetrical
xxNormal <- rnorm(n = nn, mean = 0, sd = 1)
# #The exponential distribution is positive skew
xxExp <- rexp(n = nn, rate = 1)
# #The beta distribution with hyper-parameters α=5 and β=2 is negative skew
xxBeta <- rbeta(n = nn, shape1 = 5, shape2 = 2)
#
# #Save
f_setRDS(xxNormal)
f_setRDS(xxExp)
f_setRDS(xxBeta)
#f_getRDS(xxNormal)# #Get the Distributions
xxNormal <- f_getRDS(xxNormal)
xxExp <- f_getRDS(xxExp)
xxBeta <- f_getRDS(xxBeta)# #Density Curve
# #Assumes 'hh' has data in 'ee'. In: caption_hh
#Basics
mean_hh <- mean(hh$ee)
sd_hh <- sd(hh$ee)
#
skew_hh <- skewness(hh$ee)
kurt_hh <- kurtosis(hh$ee)
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]], .05)
q95_hh <- quantile(hh[[1]], .95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 3 * sd_hh})
sig3l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 3 * sd_hh})
sig2r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + 2 * sd_hh}, {x < mean_hh + 3 * sd_hh})
sig2l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - 2 * sd_hh}, {x > mean_hh - 3 * sd_hh})
sig1r_hh <- density_hh_tbl %>% filter(x >= {mean_hh + sd_hh}, {x < mean_hh + 2 * sd_hh})
sig1l_hh <- density_hh_tbl %>% filter(x <= {mean_hh - sd_hh}, {x > mean_hh - 2 * sd_hh})
sig0r_hh <- density_hh_tbl %>% filter(x > mean_hh, {x < mean_hh + 1 * sd_hh})
sig0l_hh <- density_hh_tbl %>% filter(x < mean_hh, {x > mean_hh - 1 * sd_hh})
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
#
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_density(alpha = 0.2, colour = "#21908CFF") +
geom_area(data = sig3l_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = '#440154FF') +
geom_area(data = sig2l_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig2r_hh, aes(x = x, y = y), fill = '#3B528BFF') +
geom_area(data = sig1l_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig1r_hh, aes(x = x, y = y), fill = '#21908CFF') +
geom_area(data = sig0l_hh, aes(x = x, y = y), fill = '#5DC863FF') +
geom_area(data = sig0r_hh, aes(x = x, y = y), fill = '#5DC863FF') +
scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh) +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),
axis.ticks = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
axis.line.y = element_blank(), axis.title.y = element_blank(), axis.text.y = element_blank()) +
labs(x = "x", y = "Density",
subtitle = paste0("Mean = ", round(mean_hh, 3), "; SD = ", round(sd_hh, 3), "; Skewness = ", round(skew_hh, 3), "; Kurtosis = ", round(kurt_hh, 3)),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
rm(C03)Distributions with zero excess kurtosis are called mesokurtic. The most prominent example of a mesokurtic distribution is the normal distribution. The kurtosis of any univariate normal distribution is 3.
Distributions with kurtosis less than 3 are said to be platykurtic. It means the distribution produces fewer and less extreme outliers than does the normal distribution. An example of a platykurtic distribution is the uniform distribution, which does not produce outliers.
Distributions with kurtosis greater than 3 are said to be leptokurtic. An example of a leptokurtic distribution is the Laplace distribution, which has tails that asymptotically approach zero more slowly than a Gaussian, and therefore produces more outliers than the normal distribution.
Kurtosis is the average (or expected value) of the standardized data raised to the fourth power. Any standardized values that are less than 1 (i.e., data within one standard deviation of the mean, where the “peak” would be), contribute virtually nothing to kurtosis, since raising a number that is less than 1 to the fourth power makes it closer to zero. The only data values that contribute to kurtosis in any meaningful way are those outside the region of the peak; i.e., the outliers. Therefore, kurtosis measures outliers only; it measures nothing about the “peak.”
The sample kurtosis is a useful measure of whether there is a problem with outliers in a data set. Larger kurtosis indicates a more serious outlier problem.
# #Kurtosis Calculation: Package "e1071" (Package "moments" deprecated)
dis_lst <- list(xxNormal, xxExp, xxBeta)
#
# #Kurtosis: Normal has value close to 3 Kurtosis (=0 excess Kurtosis)
# #Kurtosis "e1071" has Type = 3 as default. Its Type = 1 matches "moments" with difference of 3
kurt_e_t3 <- vapply(dis_lst, e1071::kurtosis, numeric(1))
kurt_e_t2 <- vapply(dis_lst, e1071::kurtosis, type = 2, numeric(1))
kurt_e_t1 <- vapply(dis_lst, e1071::kurtosis, type = 1, numeric(1))
kurt_mmt <- vapply(dis_lst, moments::kurtosis, numeric(1))
stopifnot(identical(round(kurt_e_t1, 10), round(kurt_mmt - 3, 10)))
cat(paste0("e1071: Type = 1 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t1, 4), collapse = ", "), "\n"))
## e1071: Type = 1 Kurtosis (Normal, Exp, Beta): -0.0687, 6.3223, -0.106
cat(paste0("e1071: Type = 2 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t2, 4), collapse = ", "), "\n"))
## e1071: Type = 2 Kurtosis (Normal, Exp, Beta): -0.0682, 6.326, -0.1055
cat(paste0("e1071: Type = 3 Kurtosis (Normal, Exp, Beta): ",
paste0(round(kurt_e_t3, 4), collapse = ", "), "\n"))
## e1071: Type = 3 Kurtosis (Normal, Exp, Beta): -0.0693, 6.3204, -0.1066
#
# #Formula: (sigma_ (x_i - mu)^4) /(n * sd^4)
bb <- xxNormal
kurt_man <- {sum({bb - mean(bb)}^4) / {length(bb) * sd(bb)^4}} - 3
cat(paste0("(Manual) Kurtosis of Normal: ", round(kurt_man, 4),
" (vs. e1071 Type 3 = ", round(kurt_e_t3[1], 4), ") \n"))
## (Manual) Kurtosis of Normal: -0.0693 (vs. e1071 Type 3 = -0.0693)Measures of relative location help us determine how far a particular value is from the mean. By using both the mean and standard deviation, we can determine the relative location of any observation.
\[z_i = \frac{x_i - \overline{x}}{s} \tag{7.14}\]
NOTE: “Z statistic” is a special case of “Z critical” because \(\sigma/\sqrt{n}\) is the ‘standard error of the sample mean’ which means that it is a standard deviation. Rather than (eg) a known population standard deviation or even just sample standard deviation, per CLT, it is the standard deviation of the sample mean. The ‘critical Z’ (i.e. standard score) is something than can always be computed (“a general case”) whenever there is a mean and standard deviation; it translates X into a Z variable with zero mean and unit variance. (it “imposes normality” when the data may not be normal!). The “Z statistic” similarly standardizes as a special case where it is standardizing the sample mean.
Caution:
xxflights <- f_getRDS(xxflights)
bb <- na.omit(xxflights$air_time)
# Scaling
ii <- {bb - mean(bb)} / sd(bb)
str(ii)
## num [1:327346] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "na.action")= 'omit' int [1:9430] 472 478 616 644 726 734 755 839 840 841 ...
# #scale() gives a Matrix with original mean and sd as its attribute
jj <- scale(bb)
str(jj)
## num [1:327346, 1] 0.8145 0.8145 0.0994 0.3449 -0.3702 ...
## - attr(*, "scaled:center")= num 151
## - attr(*, "scaled:scale")= num 93.7
stopifnot(identical(as.vector(ii), as.vector(jj)))
#
hh <- tibble(ee = as.vector(jj))
title_hh <- "Flights: Air Time (Scaled)"
caption_hh <- "C03P08" #xxxxFigure 7.2 Before and After Scaling
# #hh$ee title_hh caption_hh
#
C03 <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) +
geom_histogram(bins = 50, alpha = 0.4, fill = '#FDE725FF') +
geom_vline(aes(xintercept = mean(.data[["ee"]])), color = '#440154FF') +
annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE) +
coord_cartesian(ylim = c(0, 35000)) +
theme(plot.title.position = "panel") +
labs(x = "x", y = "Frequency",
subtitle = paste0("(Mean= ", round(mean(.[[1]]), 3),
"; SD= ", round(sd(.[[1]]), 3),
")"),
caption = caption_hh, title = title_hh)
}
assign(caption_hh, C03)
C03
rm(C03)if(FALSE){
# #If TeX() is present, check_overlap = TRUE is NOT being respected, slow plot draw for n >10000
geom_text(aes(label = TeX(r'($\bar{x}$)', output = "character"),
x = mean(.data[["ee"]]), y = -Inf),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE, check_overlap = TRUE)
# #Create your own dataset
geom_text(data = tibble(x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character")),
aes(x = x, y = y, label = label),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE )
# #Or Equivalent
ggplot2::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
#
ggpp::annotate(geom = "text", x = mean(.[[1]]), y = -Inf,
label = TeX(r'($\bar{x}$)', output = "character"),
color = '#440154FF', hjust = -2, vjust = -2.5, parse = TRUE)
}# #List All Colour Names in R
str(colors())
## chr [1:657] "white" "aliceblue" "antiquewhite" "antiquewhite1" "antiquewhite2" "antiquewhite3" ...
# #Packages: viridis, scales, viridisLite
# #Show N Colours with Max. Contrast
q_colors <- 5
# #Display Colours
if(FALSE) show_col(viridis_pal()(q_colors))
# #Get the Viridis i.e. "D" palette Hex Values for N Colours
v_colors <- viridis(q_colors, option = "D")
v_colors
## [1] "#440154FF" "#3B528BFF" "#21908CFF" "#5DC863FF" "#FDE725FF"Five-Number Summary is used to quickly summarise a dataset. i.e. Min, Q1, Median, Q3, Max
Figure 7.3 geom_boxplot()
# #nycflights13::weather
bb <- weather
# #NA are present in the data
summary(bb$temp)
#
# #BoxPlot
C03P01 <- bb %>% drop_na(temp) %>% mutate(month = factor(month, ordered = TRUE)) %>% {
ggplot(data = ., mapping = aes(x = month, y = temp)) +
#geom_violin() +
geom_boxplot(aes(fill = month), outlier.colour = 'red', notch = TRUE) +
stat_summary(fun = mean, geom = "point", size = 2, color = "steelblue") +
scale_y_continuous(breaks = seq(0, 110, 10), limits = c(0, 110)) +
#geom_point() +
#geom_jitter(position=position_jitter(0.2)) +
k_gglayer_box +
theme(legend.position = 'none') +
labs(x = "Months", y = "Temperature", subtitle = "With Mean & Notch",
caption = "C03P01", title = "BoxPlot")
}\[\begin{align} \sigma_{xy} &= \frac{\sum (x_i - \mu_x)(y_i - \mu_y)}{n} \\ s_{xy} &= \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{n-1} \end{align} \tag{7.15}\]
Figure 7.4 Scatter Plot Quadrants for Covariance
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11bb <- f_getRDS(xxCommercials)
# #Define the formula for Trendline calculation
k_gg_formula <- y ~ x
#
# #Scatterplot, Trendline Equation, R2, mean x & y
C03P02 <- bb %>% {
ggplot(data = ., aes(x = Commercials, y = Sales)) +
geom_smooth(method = 'lm', formula = k_gg_formula, se = FALSE) +
stat_poly_eq(aes(label = paste0("atop(", ..eq.label.., ", \n", ..rr.label.., ")")),
formula = k_gg_formula, eq.with.lhs = "italic(hat(y))~`=`~",
eq.x.rhs = "~italic(x)", parse = TRUE) +
geom_vline(aes(xintercept = round(mean(Commercials), 3)), color = 'red', linetype = "dashed") +
geom_hline(aes(yintercept = round(mean(Sales), 3)), color = 'red', linetype = "dashed") +
geom_text(aes(label = TeX(r"($\bar{x} = 3$)", output = "character"),
x = round(mean(Commercials), 3), y = -Inf),
color = 'red', , hjust = -0.2, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_text(aes(label = TeX(r"($\bar{y} = 51$)", output = "character"),
x = Inf, y = round(mean(Sales), 3)),
color = 'red', , hjust = 1.5, vjust = -0.5, parse = TRUE, check_overlap = TRUE) +
geom_point() +
k_gglayer_scatter +
labs(x = "Commercials", y = "Sales ($100s)",
subtitle = TeX(r"(Trendline Equation, $R^{2}$, $\bar{x}$ and $\bar{y}$)"),
caption = "C03P02", title = "Scatter Plot")
}\[\begin{align} \rho_{xy} &= \frac{\sigma_{xy}}{\sigma_{x}\sigma_{y}} \\ r_{xy} &= \frac{s_{xy}}{s_{x}s_{y}} \end{align} \tag{7.16}\]
# #Get 'Deviation about the mean' i.e. devX and devY and their Product devXY
ii <- bb %>%
mutate(devX = Commercials - mean(Commercials), devY = Sales - mean(Sales), devXY = devX * devY)
#
# #Sample Covariance
sxy <- sum(ii$devXY) / {length(ii$devXY) -1}
print(sxy)
## [1] 11
jj <- ii %>% mutate(devXsq = devX * devX, devYsq = devY * devY)
# #Sample Covariance Sx, Sample Standard Deviations Sx Sy
sxy <- sum(ii$devXY) / {nrow(ii) -1}
sx <- round(sqrt(sum(jj$devXsq) / {nrow(jj) -1}), 2)
sy <- round(sqrt(sum(jj$devYsq) / {nrow(jj) -1}), 2)
cat(paste0("Sxy =", sxy, ", Sx =", sx, ", Sy =", sy, "\n"))
## Sxy =11, Sx =1.49, Sy =7.93
#
# #Correlation Coefficient Rxy
rxy <- round(sxy / {sx * sy}, 2)
cat(paste0("Correlation Coefficient Rxy =", rxy, "\n"))
## Correlation Coefficient Rxy =0.93| Week | Commercials | Sales | devX | devY | devXY | devXsq | devYsq |
|---|---|---|---|---|---|---|---|
| 1 | 2 | 50 | -1 | -1 | 1 | 1 | 1 |
| 2 | 5 | 57 | 2 | 6 | 12 | 4 | 36 |
| 3 | 1 | 41 | -2 | -10 | 20 | 4 | 100 |
| 4 | 3 | 54 | 0 | 3 | 0 | 0 | 9 |
| 5 | 4 | 54 | 1 | 3 | 3 | 1 | 9 |
| 6 | 1 | 38 | -2 | -13 | 26 | 4 | 169 |
| 7 | 5 | 63 | 2 | 12 | 24 | 4 | 144 |
| 8 | 3 | 48 | 0 | -3 | 0 | 0 | 9 |
| 9 | 4 | 59 | 1 | 8 | 8 | 1 | 64 |
| 10 | 2 | 46 | -1 | -5 | 5 | 1 | 25 |
\[\begin{align} n! &= \prod _{i=1}^n i = n \cdot (n-1) \\ &= n \cdot(n-1)\cdot(n-2)\cdot(n-3)\cdot\cdots \cdot 3 \cdot 2 \cdot 1 \end{align} \tag{8.1}\]
\[C_k^N = \binom{N}{k} = \frac{N!}{k!(N-k)!} \tag{8.2}\]
\[P_k^N = k! \binom{N}{k} = \frac{N!}{(N-k)!} \tag{8.3}\]
\[P(A \cup B) = P(A) + P(B) - P(A \cap B) \tag{8.4}\]
|
|
\[\begin{align} P(A \cap B) &= P(B) \cdot P(A | B) \\ &= P(A) \cdot P(B | A) \end{align} \tag{8.5}\]
Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample, a special report, or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes theorem provides a means for making these probability calculations.
\[\begin{align} P(A_{1}|B) &= \frac{P(A_{1})P(B|A_{1})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \\ P(A_{2}|B) &= \frac{P(A_{2})P(B|A_{2})}{P(A_{1}) P(B|A_{1})+ P(A_{2}) P(B|A_{2})} \end{align} \tag{8.6}\]
| \({x}\) | \(f(x)\) | \(\sum xf(x)\) | \((x - \mu)\) | \((x - \mu)^2\) | \(\sum {(x - \mu)^{2}f(x)}\) |
|---|---|---|---|---|---|
| 0 | 0.18 | 0 | -1.5 | 2.25 | 0.405 |
| 1 | 0.39 | 0.39 | -0.5 | 0.25 | 0.0975 |
| 2 | 0.24 | 0.48 | 0.5 | 0.25 | 0.06 |
| 3 | 0.14 | 0.42 | 1.5 | 2.25 | 0.315 |
| 4 | 0.04 | 0.16 | 2.5 | 6.25 | 0.25 |
| 5 | 0.01 | 0.05 | 3.5 | 12.25 | 0.1225 |
| Total | 1.00 | mu = 1.5 | NA | NA | sigma^2 = 1.25 |
# #Dicarlo: Days with Number of Cars Sold per day for last 300 days
xxdicarlo <- tibble(Cars = 0:5, Days = c(54, 117, 72, 42, 12, 3))
#
bb <- xxdicarlo
bb <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
R_dicarlo_var_y_C05 <- sum(bb$x_mu_sq_Fx)
# #Total Row
bb <- bb %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx)) %>%
mutate(across(4:5, ~ replace(., x == "Total", NA)))# #Change Column Classes as required
bb %>% mutate(across(1, as.character))
bb %>% mutate(across(everything(), as.character))bb <- xxdicarlo
ii <- bb %>% rename(x = Cars, Fx = Days) %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Add Total Row
ii <- ii %>%
mutate(across(1, as.character)) %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum)))
#
# #Modify Specific Row Values without using filter()
# #filter() does not have 'un-filter()' function like group()-ungroup() combination
# #Selecting Row where x = "Total" and changing Column Values for Two Columns
ii <- ii %>%
mutate(xFx = ifelse(x == "Total", paste0("mu = ", xFx), xFx),
x_mu_sq_Fx = ifelse(x == "Total", paste0("sigma^2 = ", x_mu_sq_Fx), x_mu_sq_Fx))
#
# #Selecting Row where x = "Total" and doing same replacement on Two Columns
ii %>% mutate(across(4:5, function(y) replace(y, x == "Total", NA)))
ii %>% mutate(across(4:5, ~ replace(., x == "Total", NA)))
|
|
bb <- xxdicarlo_gs
# #Assuming there is NO Total Column NOR Total Row and First Column is character
kk <- bb %>% summarise(across(where(is.numeric), sum)) %>% summarise(sum(.)) %>% pull(.)
ll <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
stopifnot(identical(kk, ll))
print(kk)
## [1] 300bb <- xxdicarlo_gs
# #Round off values to 1 significant digits i.e. 0.003 or 0.02
# #NOTE: This changes the column to "character"
bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>%
mutate(across(where(is.numeric), format, digits =1))
## # A tibble: 4 x 7
## Geneva_Saratoga y0 y1 y2 y3 y4 y5
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 x0 0.07 0.10 0.08 0.03 0.007 0.000
## 2 x1 0.07 0.12 0.11 0.06 0.007 0.003
## 3 x2 0.03 0.14 0.03 0.04 0.010 0.007
## 4 x3 0.01 0.03 0.02 0.01 0.017 0.000| \(ID\) | \({s}\) | \(f(s)\) | \(\sum sf(s)\) | \((s - E(s))\) | \((s - E(s))^2\) | \(\sum {(s - E(s))^{2}f(s)}\) |
|---|---|---|---|---|---|---|
| A | 0 | 0.070 | 0.00 | -2.64 | 6.99 | 0.489 |
| B | 1 | 0.170 | 0.17 | -1.64 | 2.70 | 0.459 |
| C | 2 | 0.230 | 0.46 | -0.64 | 0.41 | 0.095 |
| D | 3 | 0.290 | 0.87 | 0.36 | 0.13 | 0.037 |
| E | 4 | 0.127 | 0.51 | 1.36 | 1.84 | 0.233 |
| F | 5 | 0.067 | 0.33 | 2.36 | 5.55 | 0.370 |
| G | 6 | 0.023 | 0.14 | 3.36 | 11.27 | 0.263 |
| H | 7 | 0.023 | 0.16 | 4.36 | 18.98 | 0.443 |
| I | 8 | 0.000 | 0.00 | 5.36 | 28.69 | 0.000 |
| Total | NA | 1.000 | E(s) = 2.64 | NA | NA | Var(s) = 2.389 |
bb <- xxdicarlo_gs
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
# #Convert to Bivariate Probability Distribution
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col()) %>%
mutate(sFs = s * Fs, s_Es = s - sum(sFs),
s_Es_sq = s_Es * s_Es, s_Es_sq_Fs = s_Es_sq * Fs)
# #Save for Notebook
R_dicarlo_var_s_C05 <- sum(kk$s_Es_sq_Fs)
# #For Printing
ll <- kk %>%
add_row(summarise(., across(1, ~"Total")), summarise(., across(where(is.double), sum))) %>%
mutate(across(where(is.numeric), format, digits =2)) %>%
mutate(sFs = ifelse(ID == "Total", paste0("E(s) = ", sFs), sFs),
s_Es_sq_Fs = ifelse(ID == "Total", paste0("Var(s) = ", s_Es_sq_Fs), s_Es_sq_Fs)) %>%
mutate(across(c(2,5,6), ~ replace(., ID == "Total", NA)))bb <- xxdicarlo_gs
# #From the Bivariate get the original data
ii <- bb %>%
mutate(Fx = rowSums(across(where(is.numeric)))) %>%
select(1, 8) %>%
separate(col = Geneva_Saratoga, into = c(NA, "x"), sep = 1) %>%
mutate(across(1, as.integer))
# #Variance Calculation
jj <- ii %>% mutate(across(Fx, ~./sum(Fx))) %>%
mutate(xFx = x * Fx, x_mu = x - sum(xFx),
x_mu_sq = x_mu * x_mu, x_mu_sq_Fx = x_mu_sq * Fx)
# #Save for Notebook
R_dicarlo_var_x_C05 <- sum(jj$x_mu_sq_Fx)
print(jj)
## # A tibble: 4 x 6
## x Fx xFx x_mu x_mu_sq x_mu_sq_Fx
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.287 0 -1.14 1.31 0.375
## 2 1 0.37 0.37 -0.143 0.0205 0.00760
## 3 2 0.257 0.513 0.857 0.734 0.188
## 4 3 0.0867 0.26 1.86 3.45 0.299bb <- xxdicarlo_gs
#
# #Tibble Total SUM
sum_bb <- bb %>% summarise(across(-1, sum)) %>% summarise(sum(.)) %>% pull(.)
#
# #Convert to Bivirate Probability Distribution and Exclude First Character Column
ii <- bb %>% mutate(across(where(is.numeric), ~./sum_bb)) %>% select(-1)
#
# #(1A, 2B, 3C, 4D, 4E, 4F, 3G, 2H, 1I) 9 Unique Combinations = 24 (4x6) Experimental Outcomes
matrix(data = LETTERS[row(ii) + col(ii)-1], nrow = 4)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] "A" "B" "C" "D" "E" "F"
## [2,] "B" "C" "D" "E" "F" "G"
## [3,] "C" "D" "E" "F" "G" "H"
## [4,] "D" "E" "F" "G" "H" "I"
#
# #Using tapply(), sum the Matrix
jj <- tapply(X= as.matrix(ii), INDEX = LETTERS[row(ii) + col(ii)-1], FUN = sum)
print(jj)
## A B C D E F G H I
## 0.07000000 0.17000000 0.23000000 0.29000000 0.12666667 0.06666667 0.02333333 0.02333333 0.00000000
# #In place of LETTERS, Numerical Index can also be used but Letters are more clear for grouping
#tapply(X= as.matrix(ii), INDEX = c(0:8)[row(ii) + col(ii)-1], FUN = sum)
#
# #Create Tibble
kk <- tibble(Fs = jj, ID = LETTERS[1:length(Fs)], s = 1:length(Fs) - 1) %>%
relocate(Fs, .after = last_col())
print(kk)
## # A tibble: 9 x 3
## ID s Fs
## <chr> <dbl> <dbl>
## 1 A 0 0.07
## 2 B 1 0.17
## 3 C 2 0.23
## 4 D 3 0.29
## 5 E 4 0.127
## 6 F 5 0.0667
## 7 G 6 0.0233
## 8 H 7 0.0233
## 9 I 8 0bb <- xxdicarlo_gs
# #Separate String based on Position
bb %>% separate(col = Geneva_Saratoga, into = c("A", "B"), sep = 1)
## # A tibble: 4 x 8
## A B y0 y1 y2 y3 y4 y5
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x 0 21 30 24 9 2 0
## 2 x 1 21 36 33 18 2 1
## 3 x 2 9 42 9 12 3 2
## 4 x 3 3 9 6 3 5 0\[\sigma_{xy} = \frac{\text{Var}(x+y) - \text{Var}(x) - \text{Var}(y)}{2} \tag{9.1}\]
dbinom(), pbinom(), qbinom(), rbinom()
dpois(), ppois(), qpois(), rpois()
\[\begin{align} E(x) &= \frac{a+b}{2} \\ \text{Var}(x) &= \frac{(b-a)^2}{12} \end{align} \tag{10.1}\]
\[f(x)={\frac {1}{\sigma {\sqrt {2 \pi}}}} e^{-{\frac {1}{2}}\left( {\frac {x-\mu }{\sigma }}\right) ^{2}} \tag{10.2}\]
Figure 10.1 Normal Distribution
# #Histogram with Density Curve, Mean and Median: Normal Distribution
ee <- f_getRDS(xxNormal)
hh <- tibble(ee)
ee <- NULL
# #Basics
median_hh <- round(median(hh[[1]]), 3)
mean_hh <- round(mean(hh[[1]]), 3)
sd_hh <- round(sd(hh[[1]]), 3)
len_hh <- nrow(hh)
#
# #Base Plot: Creates Only Density Function Line
ii <- hh %>% { ggplot(data = ., mapping = aes(x = ee)) + geom_density() }
#
# #Change the line colour and alpha
ii <- ii + geom_density(alpha = 0.2, colour = "#21908CFF")
#
# #Add Histogram with 50 bins, alpha and fill
ii <- ii + geom_histogram(aes(y = ..density..), bins = 50, alpha = 0.4, fill = '#FDE725FF')
#
# #Full Vertical Line at Mean. Goes across Function Boundary on Y-Axis
#ii <- ii + geom_vline(aes(xintercept = mean_hh), color = '#440154FF')
#
# #Shaded Area Object for line /Area upto the the Function Boundary on Y-Axis
# #Mean
ii_mean <- ggplot_build(ii)$data[[1]] %>% filter(x <= mean_hh)
# #Median
ii_median <- ggplot_build(ii)$data[[1]] %>% filter(x <= median_hh)
#
# #To show values which are less than Mean in colour
#ii <- ii + geom_area(data = ii_mean, aes(x = x, y = y), fill = 'blue', alpha = 0.5)
#
# #Line upto the Density Curve at Mean
ii <- ii + geom_segment(data = ii_mean,
aes(x = mean_hh, y = 0, xend = mean_hh, yend = density), color = "#440154FF")
#
# #Label 'Mean'
ii <- ii + geom_text(aes(label = paste0("Mean= ", mean_hh), x = mean_hh, y = -Inf),
color = '#440154FF', hjust = -0.5, vjust = -1, angle = 90, check_overlap = TRUE)
#
# #Similarly, Median Line and Label
ii <- ii + geom_segment(data = ii_median,
aes(x = median_hh, y = 0, xend = median_hh, yend = density), color = "#3B528BFF") +
geom_text(aes(label = paste0("Median= ", median_hh), x = median_hh, y = -Inf),
color = '#3B528BFF', hjust = -0.4, vjust = 1.2, angle = 90, check_overlap = TRUE)
#
# #Change Axis Limits
ii <- ii + coord_cartesian(xlim = c(-5, 5), ylim = c(0, 0.5))
#
# #Change x-Axis Ticks interval
xbreaks_hh <- seq(-3, 3)
xpoints_hh <- mean_hh + xbreaks_hh * sd_hh
# # Latex Labels
xlabels_hh <- c(TeX(r'($\,\,\mu - 3 \sigma$)'), TeX(r'($\,\,\mu - 2 \sigma$)'),
TeX(r'($\,\,\mu - 1 \sigma$)'), TeX(r'($\mu$)'), TeX(r'($\,\,\mu + 1 \sigma$)'),
TeX(r'($\,\,\mu + 2 \sigma$)'), TeX(r'($\,\,\mu + 3\sigma$)'))
#
ii <- ii + scale_x_continuous(breaks = xpoints_hh, labels = xlabels_hh)
#
# #Get Quantiles and Ranges of mean +/- sigma
q05_hh <- quantile(hh[[1]],.05)
q95_hh <- quantile(hh[[1]],.95)
density_hh <- density(hh[[1]])
density_hh_tbl <- tibble(x = density_hh$x, y = density_hh$y)
sig3l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 3 * sd_hh)
sig3r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 3 * sd_hh)
sig2r_hh <- density_hh_tbl %>% filter(x >= mean_hh + 2 * sd_hh, x < mean_hh + 3 * sd_hh)
sig2l_hh <- density_hh_tbl %>% filter(x <= mean_hh - 2 * sd_hh, x > mean_hh - 3 * sd_hh)
sig1r_hh <- density_hh_tbl %>% filter(x >= mean_hh + sd_hh, x < mean_hh + 2 * sd_hh)
sig1l_hh <- density_hh_tbl %>% filter(x <= mean_hh - sd_hh, x > mean_hh - 2 * sd_hh)
#
# #Use (mean +/- 3 sigma) To Highlight. NOT ALL Zones have been highlighted
ii <- ii + geom_area(data = sig3l_hh, aes(x = x, y = y), fill = 'red') +
geom_area(data = sig3r_hh, aes(x = x, y = y), fill = 'red')
#
# #Annotate Arrows
ii <- ii +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[3], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[2], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] -0.5 , xend = xpoints_hh[1], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[5], y = 0.42,
# yend = 0.42, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
# ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[6], y = 0.45,
# yend = 0.45, arrow = arrow(type = "closed", length = unit(0.02, "npc"))) +
ggplot2::annotate("segment", x = xpoints_hh[4] +0.5 , xend = xpoints_hh[7], y = 0.48,
yend = 0.48, arrow = arrow(type = "closed", length = unit(0.02, "npc")))
#
# #Annotate Labels
ii <- ii +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.42, label = "68.3%") +
# ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.45, label = "95.4%") +
ggplot2::annotate(geom = "text", x = xpoints_hh[4], y = 0.48, label = "99.7%")
#
# #Add a Theme and adjust Position of Title & Subtile (Both by plot.title.position) & Caption
# #"plot" or "panel"
ii <- ii + theme(#plot.tag.position = "topleft",
#plot.caption.position = "plot",
#plot.caption = element_text(hjust = 0),
plot.title.position = "panel")
#
# #Title, Subtitle, Caption, Axis Labels, Tag
ii <- ii + labs(x = "x", y = "Density",
subtitle = paste0("(N=", len_hh, "; ", "Mean= ", mean_hh,
"; Median= ", median_hh, "; SD= ", sd_hh),
caption = "C06AA", tag = NULL,
title = "Normal Distribution (Symmetrical)")
#
#ii# #Syntax
#latex2exp::Tex(r('$\sigma =10$'), output = "character")
# #Test Equation
plot(TeX(r'(abc: $\frac{2hc^2}{\lambda^5} \, \frac{1}{e^{\frac{hc}{\lambda k_B T}} - 1}$)'), cex=2)
plot(TeX(r'(xyz: $f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \,\left(\frac{x - \mu}{\sigma}\right)^2} $)'), cex=2)# #Syntax
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($\sigma =10$)', output = "character"), parse = TRUE, check_overlap = TRUE)
# #NOTE: Complex Equations like Normal Distribution are crashing the R.
ggpp::annotate("text", x = -2, y = 0.3, label=TeX(r'($f(x) =\frac{1}{\sigma \sqrt{2\pi}}\, e^{- \, \frac{1}{2} \, \left(\frac{x - \mu}{\sigma}\right)^2} $)', output = "character"), parse = TRUE, check_overlap = TRUE)# #Data
bb <- f_getRDS(xxNormal)
hh <- tibble(bb)
# #Base Plot
ii <- hh %>% { ggplot(data = ., mapping = aes(x = bb)) + geom_density() }
# #Attributes
attributes(ggplot_build(ii))$names
## [1] "data" "layout" "plot"
#
str(ggplot_build(ii)$data[[1]])
## 'data.frame': 512 obs. of 18 variables:
## $ y : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ x : num -3.63 -3.61 -3.6 -3.58 -3.57 ...
## $ density : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ scaled : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ ndensity : num 0.00126 0.0013 0.00133 0.00136 0.00137 ...
## $ count : num 5.04 5.2 5.32 5.41 5.45 ...
## $ n : int 10000 10000 10000 10000 10000 10000 10000 10000 10000 10000 ...
## $ flipped_aes: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ PANEL : Factor w/ 1 level "1": 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 0.000504 0.00052 0.000532 0.000541 0.000545 ...
## $ fill : logi NA NA NA NA NA NA ...
## $ weight : num 1 1 1 1 1 1 1 1 1 1 ...
## $ colour : chr "black" "black" "black" "black" ...
## $ alpha : logi NA NA NA NA NA NA ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype : num 1 1 1 1 1 1 1 1 1 1 ....[[1]] is discouraged. Use .data[[1]] instead.”STOP! STOP! Just STOP! using UNICODE for R Console on WINDOWS (UTF-8 Issue).
\[f(z) = \varphi(x) = \frac{1}{\sqrt{2\pi}}e^{-\frac{z^2}{2}} \tag{10.3}\]
7.17 The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
pnorm(q = 1.00)
## [1] 0.8413447# #Find Commulative Probability P corresponding to the given 'z' value
# #Area under the curve to the left of z-value = 1.00
# #pnorm(q = 1.00) #default 'lower.tail = TRUE'
z_ii <- 1.00
p_ii <- round(pnorm(q = z_ii, lower.tail = TRUE), 4)
cat(paste0("P(z <= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z <= 1.000) = 0.8413
#
# #Probability that z is in the interval between −.50 and 1.25 #0.5859
z_min_ii <- -0.50
z_max_ii <- 1.25
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-0.500 <= z <= 1.250) = 0.5858
#
# #Probability of obtaining a z value of at least 1.58 #0.0571
z_ii <- 1.58
p_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
cat(paste0("P(z >= ", format(z_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(z >= 1.580) = 0.0571
#
# #Probability that the z is within one standard deviation of the mean i.e. [-1, 1] #0.6826
z_min_ii <- -1.00
z_max_ii <- 1.00
p_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", p_ii, "\n"))
## P(-1.000 <= z <= 1.000) = 0.6827# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10
qnorm(p = 1 - 0.10)
## [1] 1.281552
qnorm(p = 0.10, lower.tail = FALSE)
## [1] 1.281552# #Find a z value such that the probability of obtaining a larger z value is .10
# #z-value for which Area under the curve towards Right is 0.10 i.e. right >10%
#qnorm(p = 1 - 0.10)
#qnorm(p = 0.10, lower.tail = FALSE)
p_r_ii <- 0.10
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
z_jj <- round(qnorm(p = p_r_ii, lower.tail = FALSE), 4)
stopifnot(identical(z_ii, z_jj))
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.900 (i.e. (Right) 1-P(z) = 0.100) at z = 1.2816\[z = \frac{x- \mu}{\sigma} \tag{10.4}\]
Reasons to convert normal distributions into the standard normal distribution:
Each z-score is associated with a probability, or p-value, that gives the likelihood of values below that z-score occurring. By converting an individual value into a z-score, we can find the probability of all values up to that value occurring in a normal distribution.
The z-score is the test statistic used in a z-test. The z-test is used to compare the means of two groups, or to compare the mean of a group to a set value. Its null hypothesis typically assumes no difference between groups.
The area under the curve to the right of a z-score is the p-value, and it’s the likelihood of your observation occurring if the null hypothesis is true.
Usually, a p-value of 0.05 or less means that your results are unlikely to have arisen by chance; it indicates a statistically significant effect.
# #For N(mu =10, sd =2) Probability that X is in [10, 14]
# #Same as P(0 <= z <= 2)
mu_ii <- 10
sd_ii <- 2
x_min_ii <- 10
x_max_ii <- 14
#
z_min_ii <- (x_min_ii - mu_ii) /sd_ii #0
z_max_ii <- (x_max_ii - mu_ii) /sd_ii #2
#
pz_ii <- round(pnorm(q = z_max_ii, lower.tail = TRUE) - pnorm(q = z_min_ii, lower.tail = TRUE), 4)
# #OR
px_ii <- round(pnorm(q = x_max_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE) -
pnorm(q = x_min_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(identical(pz_ii, px_ii))
cat(paste0("P(", format(z_min_ii, nsmall = 3), " <= z <= ",
format(z_max_ii, nsmall = 3), ") = ", pz_ii, "\n"))
## P(0.000 <= z <= 2.000) = 0.4772
cat(paste0("P(", x_min_ii, " <= x <= ", x_max_ii, ") = ", format(px_ii, nsmall = 3), "\n"))
## P(10 <= x <= 14) = 0.4772# #Grear Tire N(mu = 36500, sd =5000)
# #Probability that the tire mileage, x, will exceed 40,000 # 24.2% Tires
mu_ii <- 36500
sd_ii <- 5000
x_ii <- 40000
#
z_ii <- (x_ii - mu_ii)/sd_ii
#
#pnorm(q = 40000, mean = 36500, sd = 5000, lower.tail = FALSE)
pz_ii <- round(pnorm(q = z_ii, lower.tail = FALSE), 4)
px_ii <- round(pnorm(q = x_ii, mean = mu_ii, sd = sd_ii, lower.tail = FALSE), 4)
stopifnot(identical(px_ii, pz_ii))
#
cat(paste0("P(x >= ", x_ii, ") = ", format(px_ii, nsmall = 4), " (",
round(100* px_ii, 2), "%)\n"))
## P(x >= 40000) = 0.2420 (24.2%)
#
# #What should the guarantee mileage be if no more than 10% of the tires to be eligible
# #for the discount guarantee i.e. left <10% # ~30100 miles
p_l_ii <- 0.10
p_r_ii <- 1 - p_l_ii
#
#qnorm(p = 0.10, mean = 36500, sd = 5000)
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
xz_ii <- z_ii * sd_ii + mu_ii
#
x_ii <- round(qnorm(p = p_l_ii, mean = mu_ii, sd = sd_ii, lower.tail = TRUE), 4)
stopifnot(abs(xz_ii - x_ii) < 1)
cat(paste0("(Left) P(x) = ", p_l_ii, " (i.e. (Right) 1-P(z) = ", p_r_ii,
") at x = ", round(x_ii, 1), "\n"))
## (Left) P(x) = 0.1 (i.e. (Right) 1-P(z) = 0.9) at x = 30092.2The sample contains only a portion of the population. Some sampling error is to be expected. So, the sample results provide only estimates of the values of the corresponding population characteristics.
Random sample vs. SRS
Suppose, from a Population, we take a sample of size \({n}\) and calculate point estimate mean \(\overline{x}_{1}\). Further, we can select another random sample from the Population and get another point estimate mean \(\overline{x}_{2}\). If we repeat this process for 500 times, we will have a frame of \(\{\overline{x}_{1}, \overline{x}_{2}, \ldots, \overline{x}_{500}\}\).
If we consider the process of selecting a simple random sample as an experiment, the sample mean \({\overline{x}}\) is the numerical description of the outcome of the experiment. Thus, the sample mean \({\overline{x}}\) is a random variable. As a result, just like other random variables, \({\overline{x}}\) has a mean or expected value, a standard deviation, and a probability distribution. Because the various possible values of \({\overline{x}}\) are the result of different simple random samples, the probability distribution of \({\overline{x}}\) is called the sampling distribution of \({\overline{x}}\). Knowledge of this sampling distribution and its properties will enable us to make probability statements about how close the sample mean \({\overline{x}}\) is to the population mean \({\mu}\).
Just as with other probability distributions, the sampling distribution of \({\overline{x}}\) has an expected value or mean, a standard deviation, and a characteristic shape or form.
\[\begin{align} \text{Finite Population:} \sigma_{\overline{x}} &= \sqrt{\frac{N - n}{N-1}}\left(\frac{\sigma}{\sqrt{n}} \right) \\ \text{Infinite Population:} \sigma_{\overline{x}} &= \frac{\sigma}{\sqrt{n}} \end{align} \tag{11.1}\]
Refer Effect of Sample Size and Repeat Sampling
Figure 11.1 Effect of Sample Size vs Repeat Sampling
“ForLater”
If a statistically independent sample of \({n}\) observations \({x_1,x_2,\ldots,x_n}\) is taken from a statistical population with a standard deviation of \(\sigma\), then the mean value calculated from the sample \(\overline{x}\) will have an associated standard error of the mean \(\sigma_\overline{x}\) given by
\[\sigma_\overline{x} = \frac{\sigma}{\sqrt{n}} \tag{11.2}\]
The standard deviation \(\sigma\) of the population being sampled is seldom known. Therefore, \(\sigma_\overline{x}\) is usually estimated by replacing \(\sigma\) with the sample standard deviation \(\sigma_{x}\) instead:
\[\sigma_\overline{x} \approx \frac{\sigma_{x}}{\sqrt{n}} \tag{11.3}\]
As this is only an ‘estimator’ for the true “standard error,” other notations are used, such as:
\[\widehat{\sigma}_\overline{x} = \frac{\sigma_{x}}{\sqrt{n}} \tag{11.4}\]
OR:
\[{s}_\overline{x} = \frac{s}{\sqrt{n}} \tag{11.5}\]
Key:
Non-mathematical view:
Form of the Sampling Distribution of \({\overline{x}}\)
When the population has a normal distribution, the sampling distribution of \({\overline{x}}\) is normally distributed for any sample size.
When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of \({\overline{x}}\).
How large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal
Task of developing a profile of 2500 managers. The characteristics to be identified include the mean annual salary for the managers and the proportion of managers having completed a training.
Caution: Here, we took advantage of the fact that the population mean \({\mu}\) and the population standard deviation \({\sigma}\) were known. However, usually these values will be unknown.
Properties of Point Estimators
Three properties of good point estimators: unbiased, efficiency, and consistency.
\(\theta = \text{the population parameter of interest}\) \(\hat{\theta} = \text{the sample statistic or point estimator of } \theta\)
Other Sampling Methods
In order to develop an interval estimate of a population mean, either the population standard deviation \({\sigma}\) or the sample standard deviation \({s}\) must be used to compute the margin of error. In most applications \({\sigma}\) is not known, and \({s}\) is used to compute the margin of error.
In some applications, large amounts of relevant historical data are available and can be used to estimate the population standard deviation prior to sampling. Also, in quality control applications where a process is assumed to be operating correctly, or ‘in control,’ it is appropriate to treat the population standard deviation as known.
Sampling distribution of \({\overline{x}}\) can be used to compute the probability that \({\overline{x}}\) will be within a given distance of \({\mu}\).
Example: Lloyd Department Store
Interval Estimate of a Population Mean: \({\sigma}\) known is given by equation (12.1)
\[\begin{align} \overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \end{align} \tag{12.1}\]
where \((1 − \alpha)\) is the confidence coefficient and \(z_{\alpha/2}\) is the z-value providing an area of \(\alpha/2\) in the upper tail of the standard normal probability distribution.
For a 95% confidence interval, the confidence coefficient is \((1 − \alpha) = 0.95\) and thus, \(\alpha = 0.05\). Using the standard normal probability table, an area of \(\alpha/2 = 0.05/2 = 0.025\) in the upper tail provides \(z_{.025} = 1.96\).
# #Find z-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96
#
# #Critical Value (z) for Common Significance level Alpha (α) or Confidence level (1-α)
xxalpha <- c("10%" = 0.1, "5%" = 0.05, "5/2%" = 0.025, "1%" = 0.01, "1/2%" = 0.005)
#
# #Left Tail Test
round(qnorm(p = xxalpha, lower.tail = TRUE), 4)
## 10% 5% 5/2% 1% 1/2%
## -1.2816 -1.6449 -1.9600 -2.3263 -2.5758
#
# #Right Tail Test
round(qnorm(p = xxalpha, lower.tail = FALSE), 4)
## 10% 5% 5/2% 1% 1/2%
## 1.2816 1.6449 1.9600 2.3263 2.5758The t distribution is a family of similar probability distributions, with a specific t distribution depending on a parameter known as thedegrees of freedom. As the number of degrees of freedom increases, the difference between the t distribution and the standard normal distribution becomes smaller and smaller.
Just as \(z_{0.025}\) was used to indicate the z value providing a 0.025 area in the upper tail of a standard normal distribution, \(t_{0.025}\) indicates a 0.025 area in the upper tail of a t distribution. In general, the notation \(t_{\alpha/2}\) represents a t value with an area of \(\alpha/2\) in the upper tail of the t distribution.
As the degrees of freedom increase, the t distribution approaches the standard normal distribution. Ex: \(t_{0.025} = 2.262 \,(\text{DOF} = 9)\), \(t_{0.025} = 2.200 \,(\text{DOF} = 60)\), and \(t_{0.025} = 1.96 \,(\text{DOF} = \infty) = z_{0.025}\)
Interval Estimate of a Population Mean: \({\sigma}\) Unknown is given by equation (12.2)
\[\begin{align} \overline{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}} \end{align} \tag{12.2}\]
where \({s}\) is the sample standard deviation, \((1 − \alpha)\) is the confidence coefficient and \(t_{\alpha/2}\) is the t-value providing an area of \(\alpha/2\) in the upper tail of the t distribution with \({n-1}\) degrees of freedom.
Refer equation (7.12), the expression for the sample standard deviation is
\[{s} = \sqrt{\frac{\sum \left(x_i - \bar{x}\right)^2}{n-1}}\]
Why \((n-1)\) are the degrees of freedom
Larger sample sizes are needed if the distribution of the population is highly skewed or includes outliers.
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
# #Find t-value for confidence interval 95% i.e. (1-alpha) = 0.95 i.e. alpha = 0.05
# #To look for Area under the curve towards Right only i.e. alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
# #t-tables are unique for different degrees of freedom i.e. for DOF = 9
dof_ii <- 9
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.2622 (dof = 9)
#
dof_ii <- 60
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 2.0003 (dof = 60)
#
dof_ii <- 600
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9639 (dof = 600)
#
# #t-table have Infinity Row which is same as z-table. For DOF >100, it can be used.
dof_ii <- Inf
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.96 (dof = Inf)
#
z_ii <- round(qnorm(p = p_l_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(z) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(z) = ",
format(p_r_ii, nsmall = 3), ") at z = ", z_ii, "\n"))
## (Left) P(z) = 0.975 (i.e. (Right) 1-P(z) = 0.025) at z = 1.96# #A sample of n = 70 households provided the credit card balances.
xxCreditCards <- c(9430, 7535, 4078, 5604, 5179, 4416, 10676, 1627, 10112, 6567, 13627, 18719, 14661, 12195, 10544, 13659, 7061, 6245, 13021, 9719, 2200, 10746, 12744, 5742, 7159, 8137, 9467, 12595, 7917, 11346, 12806, 4972, 11356, 7117, 9465, 19263, 9071, 3603, 16804, 13479, 14044, 6817, 6845, 10493, 615, 13627, 12557, 6232, 9691, 11448, 8279, 5649, 11298, 4353, 3467, 6191, 12851, 5337, 8372, 7445, 11032, 6525, 5239, 6195, 12584, 15415, 15917, 12591, 9743, 10324)
f_setRDS(xxCreditCards)bb <- f_getRDS(xxCreditCards)
mean_bb <- mean(bb)
sd_bb <- sd(bb)
dof_bb <- length(bb) - 1L
# #t-value for confidence interval 95% | (1-alpha) = 0.95 | alpha = 0.05 | alpha/2 = 0.025
p_r_ii <- 0.025
p_l_ii <- 1 - p_r_ii
#
dof_ii <- dof_bb
t_ii <- round(qt(p = p_l_ii, df = dof_ii, lower.tail = TRUE), 4)
cat(paste0("(Left) P(t) = ", format(p_l_ii, nsmall = 3), " (i.e. (Right) 1-P(t) = ",
format(p_r_ii, nsmall = 3), ") at t = ", t_ii, " (dof = ", dof_ii, ")\n"))
## (Left) P(t) = 0.975 (i.e. (Right) 1-P(t) = 0.025) at t = 1.9949 (dof = 69)
#
# #Interval Estimate
err_margin_bb <- t_ii * sd_bb / sqrt(length(bb))
est_l <- mean_bb - err_margin_bb
est_r <- mean_bb + err_margin_bb
#
cat(paste0("Normal Sample (n=", length(bb), ", mean=", mean_bb, ", sd=", round(sd_bb, 1),
"):\n Point Estimate = ", mean_bb, ", Margin of error = ", round(err_margin_bb, 1),
", ", (1-2*p_r_ii) * 100, "% confidence interval is [",
round(est_l, 1), ", ", round(est_r, 1), "]"))
## Normal Sample (n=70, mean=9312, sd=4007):
## Point Estimate = 9312, Margin of error = 955.4, 95% confidence interval is [8356.6, 10267.4]Note:
Note:
All hypothesis testing applications involve collecting a sample and using the sample results to provide evidence for drawing a conclusion.
In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis.
For hypothesis tests involving a population mean, we let \(\mu_0\) denote the hypothesized value and we must choose one of the following three forms for the hypothesis test.
Alternative is One-Sided, if it states that a parameter is larger or smaller than the null value. Alternative is Two-sided, if it states that the parameter is different from the null value.
Ideally the hypothesis testing procedure should lead to the acceptance of \({H_0}\) when \({H_0}\) is true and the rejection of \({H_0}\) when \({H_a}\) is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors.
Figure 13.1 Type-I \((\alpha)\) and Type-II \((\beta)\) Errors
12.3 The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
In practice, the person responsible for the hypothesis test specifies the level of significance. By selecting \({\alpha}\), that person is controlling the probability of making a Type I error.
Although most applications of hypothesis testing control for the probability of making a Type I error, they do not always control for the probability of making a Type II error. Hence, if we decide to accept \({H_0}\), we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conducting significance tests, statisticians usually recommend that we use the statement "do not reject \({H_0}\)" instead of “accept \({H_0}\).” Using the statement “do not reject \({H_0}\)” carries the recommendation to withhold both judgment and action. In effect, by not directly accepting \({H_0}\), the statistician avoids the risk of making a Type II error.
Refer figure 13.1
13.22 The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
The test statistic summarizes the observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in the statistical model. Refer Table 13.1
| Test statistic | \({H_0}\) and \({H_a}\) | Statistical tests that use it |
|---|---|---|
| t-value | Null: The means of two groups are equal | T-test, Regression tests |
| Alternative: The means of two groups are not equal | ||
| z-value | Null: The means of two groups are equal | Z-test |
| Alternative:The means of two groups are not equal | ||
| F-value | Null: The variation among two or more groups is greater than or equal to the variation between the groups | ANOVA, ANCOVA, MANOVA |
| Alternative: The variation among two or more groups is smaller than the variation between the groups | ||
| \({\chi}^2\text{-value}\) | Null: Two samples are independent | Chi-squared test, Non-parametric correlation tests |
| Alternative: Two samples are not independent (i.e. they are correlated) |
7.14 A tail refers to the tapering sides at either end of a distribution curve.
One tailed-tests are concerned with one side of a statistic. Thus, one-tailed tests deal with only one tail of the distribution, and the z-score is on only one side of the statistic. Whereas, Two-tailed tests deal with both tails of the distribution, and the z-score is on both sides of the statistic.
In a one-tailed test, the area under the rejection region is equal to the level of significance, \({\alpha}\). When the rejection region is below the acceptance region, we say that it is a left-tail test. Similarly, when the rejection region is above the acceptance region, we say that it is a right-tail test.
In the two-tailed test, there are two critical regions, and the area under each region is \(\frac{\alpha}{2}\).
One-Tail vs. Two-Tail
One-tailed tests about a population mean take one of the following two forms:
13.4 \(\text{\{Left Tail or Lower Tail\} } H_0 : \mu \geq \mu_0 \iff H_a: \mu < \mu_0\)
13.5 \(\text{\{Right Tail or Upper Tail\} } H_0 : \mu \leq \mu_0 \iff H_a: \mu > \mu_0\)
Example: The label on a can of Hilltop Coffee states that the can contains 3 pounds of coffee. As long as the population mean filling weight is at least 3 pounds per can, the rights of consumers will be protected. Thus, the government (FTC) interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can.
\[z = \frac{\overline{x} - \mu_0}{\sigma_{\overline{x}}} = \frac{\overline{x} - \mu_0}{\sigma/\sqrt{n}} \tag{13.1}\]
The key question for a lower tail test is, How small must the test statistic \({z}\) be before we choose to reject the null hypothesis
Two approaches can be used to answer this: the p-value approach and the critical value approach.
p-value (p) is the probability of obtaining a result equal to or more extreme than was observed in the data. It is the probability of observing the result given that the null hypothesis is true. A small p-value indicates the value of the test statistic is unusual given the assumption that \({H_0}\) is true.
For a lower tail test, the p-value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. - we use the standard normal distribution to find the probability that \({z}\) is less than or equal to the value of the test statistic. - After computing the p-value, we must then decide whether it is small enough to reject the null hypothesis; this decision involves comparing the p-value to the level of significance.
For the Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if p-value \(\leq {\alpha}\)
Further, in this case, we would reject \({H_0}\) for any value of \({\alpha} \geq (p = 0.0038)\). For this reason, the p-value is also called the observed level of significance.
For a lower tail test, the critical value serves as a benchmark for determining whether the value of the test statistic is small enough to reject the null hypothesis. - Critical value is the value of the test statistic that corresponds to an area of \({\alpha}\) (the level of significance) in the lower tail of the sampling distribution of the test statistic. - In other words, the critical value is the largest value of the test statistic that will result in the rejection of the null hypothesis.
Hilltop Coffee Example
Rejection Rule: Reject \({H_0}\) if \(z \leq z_{\alpha}\)
The p-value approach to hypothesis testing and the critical value approach will always lead to the same rejection decision; that is, whenever the p-value is less than or equal to \({\alpha}\), the value of the test statistic will be less than or equal to the critical value.
For upper tail test The test statistic \({z}\) is still computed as earlier. But, for an upper tail test, the p-value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p-value for the upper tail test in the \({\sigma}\) known case, we must use the standard normal distribution to find the probability that \({z}\) is greater than or equal to the value of the test statistic. Using the critical value approach causes us to reject the null hypothesis if the value of the test statistic is greater than or equal to the critical value \(z_{\alpha}\); in other words, we reject \({H_0}\) if \(z \geq z_{\alpha}\).
12.1 Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}\)
\[Z=\frac {\overline {X}-\mu }{\sigma/{\sqrt{n}}} \quad \Leftrightarrow \mu = \overline {X} - Z \frac{\sigma}{\sqrt{n}} \quad \Rightarrow \mu = \overline {X} \pm Z \frac{\sigma}{\sqrt{n}} \quad \Rightarrow \mu \approx \overline {X} \pm Z \frac{s}{\sqrt{n}} \tag{13.2}\]
13.6 \(\text{\{Two Tail\} } H_0 : \mu = \mu_0 \iff H_a: \mu \neq \mu_0\)
Ex: Golf Company, mean driving distance is 295 yards i.e. \((\mu_0 = 295)\)
\(H_0 : \mu = 295 \iff H_a: \mu \neq 295\)
The quality control team selected \(\alpha = 0.05\) as the level of significance for the test.
From previous tests, assume known \(\sigma = 12\)
For a sample size \(n = 50\)
Suppose for the sample, \(\overline{x} = 297.6\)
p-value approach
critical value approach
Rejection Rule: Reject \({H_0}\) if \(z \leq -z_{\alpha/2}\) or \(z \geq z_{\alpha/2}\)
(Online, might be wrong) Ex: Assume that for a Population with mean \({\mu}\) unknown and standard deviation \({\sigma} = 15\), if we take a sample \({n = 100}\) its sample mean is \({\overline{x}} = 42\).
Assume \({\alpha} = 0.05\) and if we are conducting a Two Tail Test, \(Z_{\alpha/2=0.05/2} = 1.960\)
As shown in the equation (13.2), our interval range is \(\mu = \overline {X} \pm 2.94 = 42 \pm 2.94 \rightarrow \mu \in (39.06, 44.94)\)
We are 95% confident that the population mean will be between 39.04 and 44.94
Note that a 95% confidence interval does not mean there is a 95% chance that the true value being estimated is in the calculated interval. Rather, given a population, there is a 95% chance that choosing a random sample from this population results in a confidence interval which contains the true value being estimated.
Common Steps
p-Value Approach Step
Critical Value Approach
Refer equation (12.1), For the \({\sigma}\) known case, the \({(1 - \alpha)}\%\) confidence interval estimate of a population mean is given by
\[\overline{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}\]
We know that \(100 {(1 - \alpha)}\%\) of the confidence intervals generated will contain the population mean and \(100 {\alpha}\%\) of the confidence intervals generated will not contain the population mean.
Thus, if we reject \({H_0}\) whenever the confidence interval does not contain \({\mu}_0\), we will be rejecting the null hypothesis when it is true \((\mu = {\mu}_0)\) with probability \({\alpha}\).
The level of significance is the probability of rejecting the null hypothesis when it is true. So constructing a \(100 {(1 - \alpha)}\%\) confidence interval and rejecting \({H_0}\) whenever the interval does not contain \({\mu}_0\) is equivalent to conducting a two-tailed hypothesis test with \({\alpha}\) as the level of significance.
Ex: Golf company
“ForLater” - Exercises
For the \({\sigma}\) unknown case, the sampling distribution of the test statistic follows the t distribution with \((n − 1)\) degrees of freedom. Refer equation (13.3)
\[t = \frac{\overline{x} - \mu_0}{s/\sqrt{n}} \tag{13.3}\]
One-Tailed Test
Critical Value Approach - \((\text{DOF = 59}), \,t_{{\alpha} = 0.05} = 1.671\) - Because \((t = 1.84) > (t_{{\alpha} = 0.05} = 1.671)\), Reject \({H_0}\)
# #Like pnorm() is for P(z) and qnorm() is for z, pt() is for P(t) and qt() is for t.
#
# #p-value approach: Find Commulative Probability P corresponding to the given t-value & DOF=59
pt(q = 1.84, df = 59, lower.tail = FALSE)
## [1] 0.03539999
#
# #Critical Value: t-value for which Area under the curve towards Right is alpha=0.05 & DOF=59
qt(p = 0.05, df = 59, lower.tail = FALSE)
## [1] 1.671093Two Tailed Test
Critical Value Approach - \((\text{DOF = 24})\) - We find that \(P_{\left(t\right)} = 0.025\) for \(-t_{\alpha/2 = 0.025} = -2.064\) and \(t_{\alpha/2 = 0.025} = 2.064\) - Compare test statistic with z-value - Because \((t = -1.10)\) is NOT lower than \((-z_{\alpha/2 = 0.025} = -2.064)\), we can NOT reject \({H_0}\)
If the purpose of a hypothesis test is to make a decision when \({H_0}\) is true and a different decision when \({H_a}\) is true, the decision maker may want to, and in some cases be forced to, take action with both the conclusion do not reject \({H_0}\) and the conclusion reject \({H_0}\).
If this situation occurs, statisticians generally recommend controlling the probability of making a Type II error. With the probabilities of both the Type I and Type II error controlled, the conclusion from the hypothesis test is either to accept \({H_0}\) or reject \({H_0}\). In the first case, \({H_0}\) is concluded to be true, while in the second case, \({H_a}\) is concluded true. Thus, a decision and appropriate action can be taken when either conclusion is reached.
“ForLater” - Calculate \({\beta}\)
When the true population mean \({\mu}\) is close to the null hypothesis value of \({\mu} = 120\), the probability is high that we will make a Type II error. However, when the true population mean \({\mu}\) is far below the null hypothesis value of \({\mu} = 120\), the probability is low that we will make a Type II error.
Note that the power curve extends over the values of \({\mu}\) for which the null hypothesis is false. The height of the power curve at any value of \({\mu}\) indicates the probability of correctly rejecting \({H_0}\) when \({H_0}\) is false.
We can make 3 observations about the relationship among \({\alpha}, \beta, n (\text{sample size})\).
How interval estimates and hypothesis tests can be developed for situations involving two populations when the difference between the two population means or the two population proportions is of prime importance.
Example
To develop an interval estimate of the difference between the mean starting salary for a population of men and the mean starting salary for a population of women.
To conduct a hypothesis test to determine whether any difference is present between the proportion of defective parts in a population of parts produced by supplier A and the proportion of defective parts in a population of parts produced by supplier B.
The matched sample design is generally preferred to the independent sample design because the matched-sample procedure often improves the precision of the estimate.
In many manufacturing applications, controlling the process variance is extremely important in maintaining quality.
The sample variance \({s^2}\), given by equation (7.11), is the point estimator of the population variance \({\sigma}^2\).
\[s^2 = \frac{\sum {(x_i - {\overline{x}})}^2}{n-1} \tag{7.11}\]
Note:
Interval Estimation
Example: Asample of 20 containers \({n =20}\) has the sample variance \({s^2} = 0.0025\)
# #pnorm() qnorm() | pt() qt() | pchisq() qchisq()
#
# #p-value approach: Find Commulative Probability P corresponding to the given ChiSq & DOF=59
pchisq(q = 32.852, df = 19, lower.tail = FALSE)
## [1] 0.02500216
#
# #ChiSq value for which Area under the curve towards Right is alpha=0.025 & DOF=19 #32.852
qchisq(p = 0.025, df = 19, lower.tail = FALSE)
## [1] 32.85233\[\frac{(n-1)s^2}{{\chi_{0.025}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{0.975}^2}} \tag{15.1}\]
Generalising the equation (15.1), the equation (15.2) is the interval estimate of a population variance.
\[\frac{(n-1)s^2}{{\chi_{{\alpha}/2}^2}} \leq {\sigma}^2 \leq \frac{(n-1)s^2}{{\chi_{{1-\alpha}/2}^2}} \tag{15.2}\]
where the \({\chi^2}\) values are based on a chi-square distribution with \({n-1}\) degrees of freedom and where \((1 − {\alpha})\) is the confidence coefficient.
“ForLater”
Hypothesis-testing procedures that expand our capacity for making statistical inferences about populations
All tests apply to categorical variables and all tests use a chi-square \({\chi^2}\) test statistic that is based on the differences between observed frequencies and expected frequencies. In each case, expected frequencies are computed under the assumption that the null hypothesis is true. These chi-square tests are upper tailed tests. Large differences between observed and expected frequencies provide a large value for the chi-square test statistic and indicate that the null hypothesis should be rejected.
The test for the equality of population proportions for three or more populations is based on independent random samples selected from each of the populations. The sample data show the counts for each of two categorical responses for each population. The null hypothesis is that the population proportions are equal. Rejection of the null hypothesis supports the conclusion that the population proportions are not all equal.
The test of independence between two categorical variables uses one sample from a population with the data showing the counts for each combination of two categorical variables. The null hypothesis is that the two variables are independent and the test is referred to as a test of independence. If the null hypothesis is rejected, there is statistical evidence of an association or dependency between the two variables.
The goodness of fit test is used to test the hypothesis that a population has a specific historical or theoretical probability distribution. We showed applications for populations with a multinomial probability distribution and with a normal probability distribution. Since the normal probability distribution applies to continuous data, intervals of data values were established to create the categories for the categorical variable required for the goodness of fit test.
Analysis of variance (ANOVA) can be used to test for differences among means of several populations or treatments.
The completely randomized design and the randomized block design are used to draw conclusions about differences in the means of a single factor. The primary purpose of blocking in the randomized block design is to remove extraneous sources of variation from the error term. Such blocking provides a better estimate of the true error variance and a better test to determine whether the population or treatment means of the factor differ significantly.
The basis for the statistical tests used in analysis of variance and experimental design is the development of two independent estimates of the population variance \({\sigma}^2\). In the single-factor case, one estimator is based on the variation between the treatments; this estimator provides an unbiased estimate of \({\sigma}^2\) only if the means \(\{{\mu}_1, {\mu}_2, \ldots, {\mu}_k\}\) are all equal. A second estimator of \({\sigma}^2\) is based on the variation of the observations within each sample; this estimator will always provide an unbiased estimate of \({\sigma}^2\).
By computing the ratio of these two estimators (the F statistic), it is determined whether to reject the null hypothesis that the population or treatment means are equal.
In all the experimental designs considered, the partitioning of the sum of squares and degrees of freedom into their various sources enabled us to compute the appropriate values for the analysis of variance calculations and tests.
Further, Fisher;s LSD procedure and the Bonferroni adjustment can be used to perform pairwise comparisons to determine which means are different.
\[{y} = {\beta}_0 + {\beta}_1 {x} + {\epsilon} \tag{18.1}\]
Note
Multiple regression analysis enables us to understand how a dependent variable is related to two or more independent variables.
Parametric methods mostly require quantitative data. However these are generally sometimes more powerful than nonparametric methods.
Most of the statistical methods referred to as parametric methods require quantitative data, while nonparametric methods allow inferences based on either categorical or quantitative data.
1.1: Vectors
Vectors are the simplest type of data structure in R. A vector is a sequence of data elements of the same basic type.
1.2: Components
Members of a vector are called components.
1.3: Packages
Packages are the fundamental units of reproducible R code.
2.1: R-Markdown
R Markdown is a file format for making dynamic documents with R.
2.2: NA
NA is a logical constant of length 1 which contains a missing value indicator.
2.3: Factors
Factors are the data objects which are used to categorize the data and store it as levels.
2.4: Lists
Lists are by far the most flexible data structure in R. They can be seen as a collection of elements without any restriction on the class, length or structure of each element.
2.5: DataFrame
Data Frames are lists with restriction that all elements of a data frame are of equal length.
5.1: Data
Data are the facts and figures collected, analysed, and summarised for presentation and interpretation.
5.2: Elements
Elements are the entities on which data are collected. (Generally ROWS)
5.3: Variable
A variable is a characteristic of interest for the elements. (Generally COLUMNS)
5.4: Observation
The set of measurements obtained for a particular element is called an observation.
5.5: Statistics
Statistics is the art and science of collecting, analysing, presenting, and interpreting data.
5.6: Scale-of-Measurement
The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses.
5.7: Nominal-Scale
When the data for a variable consist of labels or names used to identify an attribute of the element, the scale of measurement is considered a nominal scale.
5.8: Ordinal-Scale
The scale of measurement for a variable is considered an ordinal scale if the data exhibit the properties of nominal data and in addition, the order or rank of the data is meaningful.
5.9: Interval-Scale
The scale of measurement for a variable is an interval scale if the data have all the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure.
5.10: Ratio-Scale
The scale of measurement for a variable is a ratio scale if the data have all the properties of interval data and the ratio of two values is meaningful.
5.11: Categorical-Data
Data that can be grouped by specific categories are referred to as categorical data. Categorical data use either the nominal or ordinal scale of measurement.
5.12: Quantitative-Data
Data that use numeric values to indicate ‘how much’ or ‘how many’ are referred to as quantitative data. Quantitative data are obtained using either the interval or ratio scale of measurement.
5.13: Discrete
Quantitative data that measure ‘how many’ are discrete.
5.14: Continuous
Quantitative data that measure ‘how much’ are continuous because no separation occurs between the possible data values.
5.15: Cross-Sectional-Data
Cross-sectional data are data collected at the same or approximately the same point in time.
5.16: Time-Series-Data
Time-series data data are data collected over several time periods.
5.17: Observational-Study
In an observational study we simply observe what is happening in a particular situation, record data on one or more variables of interest, and conduct a statistical analysis of the resulting data.
5.18: Experiment
The key difference between an observational study and an experiment is that an experiment is conducted under controlled conditions.
5.19: Descriptive-Statistics
Most of the statistical information is summarized and presented in a form that is easy to understand. Such summaries of data, which may be tabular, graphical, or numerical, are referred to as descriptive statistics.
5.20: Population
A population is the set of all elements of interest in a particular study.
5.21: Sample
A sample is a subset of the population.
5.22: Parameter-vs-Statistic
The measurable quality or characteristic is called a Population Parameter if it is computed from the population. It is called a Sample Statistic if it is computed from a sample.
5.23: Census
The process of conducting a survey to collect data for the entire population is called a census.
5.24: Sample-Survey
The process of conducting a survey to collect data for a sample is called a sample survey.
5.25: Statistical-Inference
Statistics uses data from a sample to make estimates and test hypotheses about the characteristics of a population through a process referred to as statistical inference.
5.26: Analytics
Analytics is the scientific process of transforming data into insight for making better decisions.
5.27: Descriptive-Analytics
Descriptive analytics encompasses the set of analytical techniques that describe what has happened in the past.
5.28: Predictive-Analytics
Predictive analytics consists of analytical techniques that use models constructed from past data to predict the future or to assess the impact of one variable on another.
5.29: Prescriptive-Analytics
Prescriptive analytics is the set of analytical techniques that yield a best course of action.
5.30: Big-Data
Larger and more complex data sets are now often referred to as big data.
5.31: Data-Mining
Data Mining deals with methods for developing useful decision-making information from large databases. It can be defined as the automated extraction of predictive information from (large) databases.
6.1: Frequency-Distribution
A frequency distribution is a tabular summary of data showing the number (frequency) of observations in each of several non-overlapping categories or classes.
6.2: Cross-Tab
A crosstabulation is a tabular summary of data for two variables. It is used to investigate the relationship between them. Generally, one of the variable is categorical.
7.1: Number
A number is a mathematical object used to count, measure, and label. Their study or usage is called arithmetic, a term which may also refer to number theory, the study of the properties of numbers.
7.2: Prime
A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a ‘composite number.’ 1 is neither a Prime nor a composite, it is a ‘Unit.’ Thus, by definition, Negative Integers and Zero cannot be Prime.
7.3: Parity-Odd-Even
Parity is the property of an integer \(\mathbb{Z}\) of whether it is even or odd. It is even if the integer is divisible by 2 with no remainders left and it is odd otherwise. Thus, -2, 0, +2 are even but -1, 1 are odd. Numbers ending with 0, 2, 4, 6, 8 are even. Numbers ending with 1, 3, 5, 7, 9 are odd.
7.4: Positive-Negative
An integer \(\mathbb{Z}\) is positive if it is greater than zero, and negative if it is less than zero. Zero is defined as neither negative nor positive.
7.5: Mersenne-Primes
Mersenne primes are those prime number that are of the form \((2^n -1)\); that is, \(\{3, 7, 31, 127, \ldots \}\)
7.6: Mean
Given a data set \({X=\{x_1,x_2,\ldots,x_n\}}\), the mean \({\overline{x}}\) is the sum of all of the values \({x_1,x_2,\ldots,x_n}\) divided by the count \({n}\).
7.7: Median
Median of a population is any value such that at most half of the population is less than the proposed median and at most half is greater than the proposed median.
7.8: Geometric-Mean
The geometric mean \(\overline{x}_g\) is a measure of location that is calculated by finding the n^{th} root of the product of \({n}\) values.
7.9: Mode
The mode is the value that occurs with greatest frequency.
7.10: Percentile
A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. For a data set containing \({n}\) observations, the \(p^{th}\) percentile divides the data into two parts: approximately p% of the observations are less than the \(p^{th}\) percentile, and approximately (100 – p)% of the observations are greater than the \(p^{th}\) percentile.
7.11: Variance
The variance \(({\sigma}^2)\) is based on the difference between the value of each observation \({x_i}\) and the mean \({\overline{x}}\). The average of the squared deviations is called the variance.
7.12: Standard-Deviation
The standard deviation (\(s, \sigma\)) is defined to be the positive square root of the variance. It is a measure of the amount of variation or dispersion of a set of values.
7.13: Skewness
Skewness \((\tilde{\mu}_{3})\) is a measure of the shape of a data distribution. It is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
7.14: Tails
A tail refers to the tapering sides at either end of a distribution curve.
7.15: Kurtosis
Kurtosis \((\tilde{\mu}_{4})\) is a measure of the “tailedness” of the probability distribution of a real-valued random variable. Like skewness, kurtosis describes the shape of a probability distribution. For \({\mathcal {N}}_{(\mu,\, \sigma)}\), kurtosis is 3 and excess kurtosis is 0 (i.e. subtract 3).
7.16: TheSample
A sample of \({n}\) observations given by \({X=\{x_1,x_2,\ldots,x_n\}}\) have a sample mean \({\overline{x}}\) and the sample standard deviation, \({s}\).
7.17: z-Scores
The z-score, \({z_i}\), can be interpreted as the number of standard deviations \({x_i}\) is from the mean \({\overline{x}}\). It is associated with each \({x_i}\). The z-score is often called the standardized value or standard score.
7.18: t-statistic
Computing a z-score requires knowing the mean \({\mu}\) and standard deviation \({\sigma}\) of the complete population to which a data point belongs. If one only has a sample of observations from the population, then the analogous computation with sample mean \({\overline{x}}\) and sample standard deviation \({s}\) yields the t-statistic.
7.19: Chebyshev-Theorem
Chebyshev Theorem can be used to make statements about the proportion of data values that must be within a specified number of standard deviations \({\sigma}\), of the mean \({\mu}\).
7.20: Empirical-Rule
Empirical rule is used to compute the percentage of data values that must be within one, two, and three standard deviations \({\sigma}\) of the mean \({\mu}\) for a normal distribution. These probabilities are Pr(x) 68.27%, 95.45%, and 99.73%.
7.21: Outliers
Sometimes unusually large or unusually small values are called outliers. It is a data point that differs significantly from other observations.
7.22: Covariance
Covariance is a measure of linear association between two variables. Positive values indicate a positive relationship; negative values indicate a negative relationship.
7.23: Correlation-Coefficient
Correlation coefficient is a measure of linear association between two variables that takes on values between -1 and +1. Values near +1 indicate a strong positive linear relationship; values near -1 indicate a strong negative linear relationship; and values near zero indicate the lack of a linear relationship.
8.1: Probability
Probability is a numerical measure of the likelihood that an event will occur. Probability values are always assigned on a scale from 0 to 1. A probability near zero indicates an event is unlikely to occur; a probability near 1 indicates an event is almost certain to occur.
8.2: Random-Experiment
A random experiment is a process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined completely by chance.
8.3: Sample-Space
The sample space for a random experiment is the set of all experimental outcomes.
8.4: Counting-Rule
Counting Rule for Multiple-Step Experiments: If an experiment can be described as a sequence of \({k}\) steps with \({n_1}\) possible outcomes on the first step, \({n_2}\) possible outcomes on the second step, and so on, then the total number of experimental outcomes is given by \(\{(n_1)(n_2) \cdots (n_k) \}\)
8.5: Tree-Diagram
A tree diagram is a graphical representation that helps in visualizing a multiple-step experiment.
8.6: Factorial
The factorial of a non-negative integer \({n}\), denoted by \(n!\), is the product of all positive integers less than or equal to n. The value of 0! is 1 i.e. \(0!=1\)
8.7: Combinations
Combination allows one to count the number of experimental outcomes when the experiment involves selecting \({k}\) objects from a set of \({N}\) objects. The number of combinations of \({N}\) objects taken \({k}\) at a time is equal to the binomial coefficient \(C_k^N\)
8.8: Permutations
Permutation allows one to compute the number of experimental outcomes when \({k}\) objects are to be selected from a set of \({N}\) objects where the order of selection is important. The same \({k}\) objects selected in a different order are considered a different experimental outcome. The number of permutations of \({N}\) objects taken \({k}\) at a time is given by \(P_k^N\)
8.9: Event
An event is a collection of sample points. The probability of any event is equal to the sum of the probabilities of the sample points in the event. The sample space, \({s}\), is an event. Because it contains all the experimental outcomes, it has a probability of 1; that is, \(P(S) = 1\)
8.10: Complement
Given an event \({A}\), the complement of A (\(A^c\)) is defined to be the event consisting of all sample points that are not in A. Thus, \(P(A) + P(A^{c}) =1\)
8.11: Union
Given two events A and B, the union of A and B is the event containing all sample points belonging to A or B or both. The union is denoted by \(A \cup B\)
8.12: Intersection
Given two events A and B, the intersection of A and B is the event containing the sample points belonging to both A and B. The intersection is denoted by \(A \cap B\)
8.13: Mutually-Exclusive
Two events are said to be mutually exclusive if the events have no sample points in common. Thus, \(A \cap B = 0\)
8.14: Conditional-Probability
Conditional probability is the probability of an event given that another event already
8.14: Conditional-Probability
occurred. The conditional probability of ‘A given B’ is \(P(A|B) = \frac{P(A \cup B)}{P(B)}\)
8.15: Events-Independent
Two events A and B are independent if \(P(A|B) = P(A) \quad \text{OR} \quad P(B|A) = P(B) \Rightarrow P(A \cap B) = P(A) \cdot P(B)\)
9.1: Random-Variable
A random variable is a numerical description of the outcome of an experiment. Random variables must assume numerical values. It can be either ‘discrete’ or ‘continuous.’
9.2: Discrete-Random-Variable
A random variable that may assume either a finite number of values or an infinite sequence of values such as \(0, 1, 2, \dots\) is referred to as a discrete random variable. It includes factor type i.e. Male as 0, Female as 1 etc.
9.3: Continuous-Random-Variable
A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. It is given by \(x \in [n, m]\). If the entire line segment between the two points also represents possible values for the random variable, then the random variable is continuous.
9.4: Probability-Distribution
The probability distribution for a random variable describes how probabilities are distributed over the values of the random variable.
9.5: Probability-Function
For a discrete random variable x, a probability function \(f(x)\), provides the probability for each value of the random variable.
9.6: Expected-Value-Discrete
The expected value, or mean, of a random variable is a measure of the central location for the random variable. i.e. \(E(x) = \mu = \sum xf(x)\)
9.7: Variance-Discrete
The variance is a weighted average of the squared deviations of a random variable from its mean. The weights are the probabilities. i.e. \(\text{Var}(x) = \sigma^2 = \sum \{(x- \mu)^2 \cdot f(x)\}\)
9.8: Bivariate
A probability distribution involving two random variables is called a bivariate probability distribution. A discrete bivariate probability distribution provides a probability for each pair of values that may occur for the two random variables.
10.1: Uniform-Probability-Distribution
Uniform probability distribution is a continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Whenever the probability is proportional to the length of the interval, the random variable is uniformly distributed.
10.2: Probability-Density-Function
The probability that the continuous random variable \({x}\) takes a value between \([a, b]\) is given by the area under the graph of probability density function \(f(x)\); that is, \(A = \int _{a}^{b}f(x)\ dx\). Note that \(f(x)\) can be greater than 1, however its integral must be equal to 1.
10.3: Normal-Distribution
A normal distribution (\({\mathcal {N}}_{(\mu,\, \sigma^2)}\)) is a type of continuous probability distribution for a real-valued random variable.
10.4: Standard-Normal
A random variable that has a normal distribution with a mean of zero \(({\mu} = 0)\) and a standard deviation of one \(({\sigma} = 1)\) is said to have a standard normal probability distribution. The z-distribution is given by \({\mathcal {z}}_{({\mu} = 0,\, {\sigma} = 1)}\)
11.1: Sampled-Population
The sampled population is the population from which the sample is drawn.
11.2: Frame
Frame is a list of the elements that the sample will be selected from.
11.3: Target-Population
The target population is the population we want to make inferences about. Generally (adn preferably), it will be same as ‘Sampled-Population,’ but it may differ also.
11.4: SRS
A simple random sample (SRS) is a set of \({k}\) objects in a population of \({N}\) objects where all possible samples are equally likely to happen. The number of such different simple random samples is \(C_k^N\)
11.5: Sampling-without-Replacement
Sampling without replacement: Once an element has been included in the sample, it is removed from the population and cannot be selected a second time.
11.6: Sampling-with-Replacement
Sampling with replacement: Once an element has been included in the sample, it is returned to the population. A previously selected element can be selected again and therefore may appear in the sample more than once.
11.7: Random-Sample
A random sample of size \({n}\) from an infinite population is a sample selected such that the following two conditions are satisfied. Each element selected comes from the same population. Each element is selected independently. The second condition prevents selection bias.
11.8: Proportion
A population proportion \({P}\), is a parameter that describes a percentage value associated with a population. It is given by \(P = \frac{X}{N}\), where \({x}\) is the count of successes in the population, and \({N}\) is the size of the population. It is estimated through sample proportion \(\overline{p} = \frac{x}{n}\), where \({x}\) is the count of successes in the sample, and \({N}\) is the size of the sample obtained from the population.
11.9: Point-Estimation
To estimate the value of a population parameter, we compute a corresponding characteristic of the sample, referred to as a sample statistic. This process is called point estimation.
11.10: Point-Estimator
A sample statistic is the point estimator of the corresponding population parameter. For example, \(\overline{x}, s, s^2, s_{xy}, r_{xy}\) sample statics are point estimators for corresponding population parameters of \({\mu}\) (mean), \({\sigma}\) (standard deviation), \(\sigma^2\) (variance), \(\sigma_{xy}\) (covariance), \(\rho_{xy}\) (correlation)
11.11: Point-Estimate
The numerical value obtained for the sample statistic is called the point estimate. Estimate is used for sample value only, for population value it would be parameter. Estimate is a value while Estimator is a function.
11.12: Sampling-Distribution
The sampling distribution of \({\overline{x}}\) is the probability distribution of all possible values of the sample mean \({\overline{x}}\).
11.13: Standard-Error
In general, standard error \(\sigma_{\overline{x}}\) refers to the standard deviation of a point estimator. The standard error of \({\overline{x}}\) is the standard deviation of the sampling distribution of \({\overline{x}}\).
11.14: Sampling-Error
A sampling error is the difference between a population parameter and a sample statistic.
11.15: Central-Limit-Theorem
Central Limit Theorem: In selecting random samples of size \({n}\) from a population, the sampling distribution of the sample mean \({\overline{x}}\) can be approximated by a normal distribution as the sample size becomes large.
12.1: Interval-Estimate
Because a point estimator cannot be expected to provide the exact value of the population parameter, an interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate. \(\text{Interval Estimate} = \text{Point Estimate} \pm \text{Margin of Error}\)
12.2: Confidence-Interval
Confidence interval is another name for an interval estimate. Normally it is given as \((1 - \alpha)\). Ex: 95% confidence interval
12.3: Confidence-Coefficient
The confidence level expressed as a decimal value is the confidence coefficient (\(1-{\alpha}\)). i.e. 0.95 is the confidence coefficient for a 95% confidence level.
12.4: t-distribution
When \({s}\) is used to estimate \({\sigma}\), the margin of error and the interval estimate for the population mean are based on a probability distribution known as the t distribution.
12.5: Degrees-of-Freedom
The number of degrees of freedom is the number of values in the final calculation of a statistic that are free to vary. In general, the degrees of freedom of an estimate of a parameter are \((n - 1)\).
13.1: Hypothesis-Testing
Hypothesis testing is a process in which, using data from a sample, an inference is made about a population parameter or a population probability distribution.
13.2: Hypothesis-Null
Null Hypothesis \((H_0)\) is a tentative assumption about a population parameter. It is assumed True, by default, in the hypothesis testing procedure.
13.3: Hypothesis-Alternative
Alternative Hypothesis \((H_a)\) is the complement of the Null Hypothesis. It is concluded to be True, if the Null Hypothesis is rejected.
13.4: Hypothesis-1T-Lower-Tail
\(\text{\{Left Tail or Lower Tail\} } H_0 : \mu \geq \mu_0 \iff H_a: \mu < \mu_0\)
13.5: Hypothesis-1T-Upper-Tail
\(\text{\{Right Tail or Upper Tail\} } H_0 : \mu \leq \mu_0 \iff H_a: \mu > \mu_0\)
13.6: Hypothesis-2T-Two-Tail
\(\text{\{Two Tail\} } H_0 : \mu = \mu_0 \iff H_a: \mu \neq \mu_0\)
13.7: Error-Type-I
The error of rejecting \({H_0}\) when it is true, is Type I error.
13.8: Error-Type-II
The error of accepting \({H_0}\) when it is false, is Type II error.
13.9: Level-of-Significance
The level of significance \((\alpha)\) is the probability of making a Type I error when the null hypothesis is true as an equality.
13.10: Significance-Tests
Applications of hypothesis testing that only control for the Type I error \((\alpha)\) are called significance tests.
13.11: Test-Statistic
Test statistic is a number calculated from a statistical test of a hypothesis. It shows how closely the observed data match the distribution expected under the null hypothesis of that statistical test. It helps determine whether a null hypothesis should be rejected.
13.12: Tailed-Test
A one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic.
13.13: One-Tailed-Test
One-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution.
13.14: Approach-p-value
The p-value approach uses the value of the test statistic \({z}\) to compute a probability called a p-value.
13.15: p-value
A p-value is a probability that provides a measure of the evidence against the null hypothesis provided by the sample. The p-value is used to determine whether the null hypothesis should be rejected. Smaller p-values indicate more evidence against \({H_0}\).
13.16: Approach-Critical-Value
The critical value approach requires that we first determine a value for the test statistic called the critical value.
13.17: Critical-Value
Critical value is the value that is compared with the test statistic to determine whether \({H_0}\) should be rejected. Significance level \({\alpha}\), or confidence level (\(1 - {\alpha}\)), dictates the critical value (\(Z\)), or critical limit. Ex: For Upper Tail Test, \(Z_{{\alpha} = 0.05} = 1.645\).
13.18: Acceptance-Region
A acceptance region (confidence interval), is a set of values for the test statistic for which the null hypothesis is accepted. i.e. if the observed test statistic is in the confidence interval then we accept the null hypothesis and reject the alternative hypothesis.
13.19: Margin-Error
The margin of error tells how far the original population means might be from the sample mean. It is given by \(Z\frac{\sigma}{\sqrt{n}}\)
13.20: Rejection-Region
A rejection region (critical region), is a set of values for the test statistic for which the null hypothesis is rejected. i.e. if the observed test statistic is in the critical region then we reject the null hypothesis and accept the alternative hypothesis.
13.21: Two-Tailed-Test
Two-tailed test is a hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution.
13.22: Power
The probability of correctly rejecting \({H_0}\) when it is false is called the power of the test. For any particular value of \({\mu}\), the power is \(1 - \beta\).
13.23: Power-Curve
Power Curve is a graph of the probability of rejecting \({H_0}\) for all possible values of the population parameter \({\mu}\) not satisfying the null hypothesis. It provides the probability of correctly rejecting the null hypothesis.
15.1: Distribution-Chi-Square
Whenever a simple random sample of size \({n}\) is selected from a normal population, the sampling distribution of \(\frac{(n-1)s^2}{{\sigma}^2}\) is a chi-square distribution with \({n - 1}\) degrees of freedom.
15.2: Distribution-F
The F distribution is based on sampling from two normal populations.
18.1: Regression-Analysis
Regression analysis can be used to develop an equation showing how two or more variables variables are related.
18.2: Variable-Dependent
The variable being predicted is called the dependent variable \(({y})\).
18.3: Variable-Independent
The variable or variables being used to predict the value of the dependent variable are called the independent variables \(({x})\).
18.4: Simple-Linear-Regression
The simplest type of regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line, is called simple linear regression.
18.5: Regression-Model
The equation that describes how \({y}\) is related to \({x}\) and an error term is called the regression model. For example, simple linear regression model is given by equation (18.1)
22.1: Parametric-Methods
Parametric methods are the statistical methods that begin with an assumption about the probability distribution of the population which is often that the population has a normal distribution. A sampling distribution for the test statistic can then be derived and used to make an inference about one or more parameters of the population such as the population mean \({\mu}\) or the population standard deviation \({\sigma}\).
22.2: Distribution-free-Methods
Distribution-free methods are the Statistical methods that make no assumption about the probability distribution of the population.
22.3: Nonparametric-Methods
Nonparametric methods are the statistical methods that require no assumption about the form of the probability distribution of the population and are often referred to as distribution free methods. Several of the methods can be applied with categorical as well as quantitative data.
1.1: cannot-open-connection
Error in file(file, ifelse(append, "a", "w")) : cannot open the connection
1.2: need-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
1.3: par-old-par
Error in par(old.par) : invalid value specified for graphical parameter "pin"
2.1: plot-finite-xlim
Error in plot.window(...) : need finite ’xlim’ values
2.2: Function-Not-Found
Error in arrange(bb, day) : could not find function "arrange"
3.1: Object-Not-Found-01
Error in match.arg(method) : object ’day’ not found
3.2: Comparison-possible
Error in day == 1 : comparison (1) is possible only for atomic and list types
3.3: UseMethod-No-applicable-method
Error in UseMethod("select") : no applicable method for ’select’ applied to an object of class "function"
3.4: Object-Not-Found-02
Error: Problem with mutate() column ... column object ’arr_delay’ not found
6.1: gg-stat-count-geom-bar
Error: stat_count() can only have an x or y aesthetic.
10.1: ggplot-list
Error in is.finite(x) : default method not implemented for type ’list’
10.2: ggplot-data
Error: Must subset the data pronoun with a string.
f_()
## [1] ""
#
difftime(Sys.time(), k_start)
## Time difference of 17.22668 secs
#
if(!identical(0L, length(k_all_chunk_times_lt))) {
print(k_all_chunk_times_lt[order(unlist(k_all_chunk_times_lt), decreasing=TRUE)], digits = 2)
}